diff --git a/.gitignore b/.gitignore deleted file mode 100644 index 201806e..0000000 --- a/.gitignore +++ /dev/null @@ -1,16 +0,0 @@ -.eggs/ -/sgloader/__pycache__/ -/dataset/ -*.pyc -/.coverage -/scratch/swhgitloader.cProfile -/scratch/swhgitloader.profile -/scratch/save.p -*.egg-info -version.txt -/resources/repo-linux-to-load.ini -/resources/repo-to-load.ini -build/ -dist/ -.hypothesis -.pytest_cache diff --git a/AUTHORS b/AUTHORS deleted file mode 100644 index 2d0a34a..0000000 --- a/AUTHORS +++ /dev/null @@ -1,3 +0,0 @@ -Copyright (C) 2015 The Software Heritage developers - -See http://www.softwareheritage.org/ for more information. diff --git a/LICENSE b/LICENSE deleted file mode 100644 index 94a9ed0..0000000 --- a/LICENSE +++ /dev/null @@ -1,674 +0,0 @@ - GNU GENERAL PUBLIC LICENSE - Version 3, 29 June 2007 - - Copyright (C) 2007 Free Software Foundation, Inc. - Everyone is permitted to copy and distribute verbatim copies - of this license document, but changing it is not allowed. - - Preamble - - The GNU General Public License is a free, copyleft license for -software and other kinds of works. - - The licenses for most software and other practical works are designed -to take away your freedom to share and change the works. By contrast, -the GNU General Public License is intended to guarantee your freedom to -share and change all versions of a program--to make sure it remains free -software for all its users. We, the Free Software Foundation, use the -GNU General Public License for most of our software; it applies also to -any other work released this way by its authors. You can apply it to -your programs, too. - - When we speak of free software, we are referring to freedom, not -price. Our General Public Licenses are designed to make sure that you -have the freedom to distribute copies of free software (and charge for -them if you wish), that you receive source code or can get it if you -want it, that you can change the software or use pieces of it in new -free programs, and that you know you can do these things. - - To protect your rights, we need to prevent others from denying you -these rights or asking you to surrender the rights. Therefore, you have -certain responsibilities if you distribute copies of the software, or if -you modify it: responsibilities to respect the freedom of others. - - For example, if you distribute copies of such a program, whether -gratis or for a fee, you must pass on to the recipients the same -freedoms that you received. You must make sure that they, too, receive -or can get the source code. And you must show them these terms so they -know their rights. - - Developers that use the GNU GPL protect your rights with two steps: -(1) assert copyright on the software, and (2) offer you this License -giving you legal permission to copy, distribute and/or modify it. - - For the developers' and authors' protection, the GPL clearly explains -that there is no warranty for this free software. For both users' and -authors' sake, the GPL requires that modified versions be marked as -changed, so that their problems will not be attributed erroneously to -authors of previous versions. - - Some devices are designed to deny users access to install or run -modified versions of the software inside them, although the manufacturer -can do so. This is fundamentally incompatible with the aim of -protecting users' freedom to change the software. The systematic -pattern of such abuse occurs in the area of products for individuals to -use, which is precisely where it is most unacceptable. Therefore, we -have designed this version of the GPL to prohibit the practice for those -products. If such problems arise substantially in other domains, we -stand ready to extend this provision to those domains in future versions -of the GPL, as needed to protect the freedom of users. - - Finally, every program is threatened constantly by software patents. -States should not allow patents to restrict development and use of -software on general-purpose computers, but in those that do, we wish to -avoid the special danger that patents applied to a free program could -make it effectively proprietary. To prevent this, the GPL assures that -patents cannot be used to render the program non-free. - - The precise terms and conditions for copying, distribution and -modification follow. - - TERMS AND CONDITIONS - - 0. Definitions. - - "This License" refers to version 3 of the GNU General Public License. - - "Copyright" also means copyright-like laws that apply to other kinds of -works, such as semiconductor masks. - - "The Program" refers to any copyrightable work licensed under this -License. Each licensee is addressed as "you". "Licensees" and -"recipients" may be individuals or organizations. - - To "modify" a work means to copy from or adapt all or part of the work -in a fashion requiring copyright permission, other than the making of an -exact copy. The resulting work is called a "modified version" of the -earlier work or a work "based on" the earlier work. - - A "covered work" means either the unmodified Program or a work based -on the Program. - - To "propagate" a work means to do anything with it that, without -permission, would make you directly or secondarily liable for -infringement under applicable copyright law, except executing it on a -computer or modifying a private copy. Propagation includes copying, -distribution (with or without modification), making available to the -public, and in some countries other activities as well. - - To "convey" a work means any kind of propagation that enables other -parties to make or receive copies. Mere interaction with a user through -a computer network, with no transfer of a copy, is not conveying. - - An interactive user interface displays "Appropriate Legal Notices" -to the extent that it includes a convenient and prominently visible -feature that (1) displays an appropriate copyright notice, and (2) -tells the user that there is no warranty for the work (except to the -extent that warranties are provided), that licensees may convey the -work under this License, and how to view a copy of this License. If -the interface presents a list of user commands or options, such as a -menu, a prominent item in the list meets this criterion. - - 1. Source Code. - - The "source code" for a work means the preferred form of the work -for making modifications to it. "Object code" means any non-source -form of a work. - - A "Standard Interface" means an interface that either is an official -standard defined by a recognized standards body, or, in the case of -interfaces specified for a particular programming language, one that -is widely used among developers working in that language. - - The "System Libraries" of an executable work include anything, other -than the work as a whole, that (a) is included in the normal form of -packaging a Major Component, but which is not part of that Major -Component, and (b) serves only to enable use of the work with that -Major Component, or to implement a Standard Interface for which an -implementation is available to the public in source code form. A -"Major Component", in this context, means a major essential component -(kernel, window system, and so on) of the specific operating system -(if any) on which the executable work runs, or a compiler used to -produce the work, or an object code interpreter used to run it. - - The "Corresponding Source" for a work in object code form means all -the source code needed to generate, install, and (for an executable -work) run the object code and to modify the work, including scripts to -control those activities. However, it does not include the work's -System Libraries, or general-purpose tools or generally available free -programs which are used unmodified in performing those activities but -which are not part of the work. For example, Corresponding Source -includes interface definition files associated with source files for -the work, and the source code for shared libraries and dynamically -linked subprograms that the work is specifically designed to require, -such as by intimate data communication or control flow between those -subprograms and other parts of the work. - - The Corresponding Source need not include anything that users -can regenerate automatically from other parts of the Corresponding -Source. - - The Corresponding Source for a work in source code form is that -same work. - - 2. Basic Permissions. - - All rights granted under this License are granted for the term of -copyright on the Program, and are irrevocable provided the stated -conditions are met. This License explicitly affirms your unlimited -permission to run the unmodified Program. The output from running a -covered work is covered by this License only if the output, given its -content, constitutes a covered work. This License acknowledges your -rights of fair use or other equivalent, as provided by copyright law. - - You may make, run and propagate covered works that you do not -convey, without conditions so long as your license otherwise remains -in force. You may convey covered works to others for the sole purpose -of having them make modifications exclusively for you, or provide you -with facilities for running those works, provided that you comply with -the terms of this License in conveying all material for which you do -not control copyright. Those thus making or running the covered works -for you must do so exclusively on your behalf, under your direction -and control, on terms that prohibit them from making any copies of -your copyrighted material outside their relationship with you. - - Conveying under any other circumstances is permitted solely under -the conditions stated below. Sublicensing is not allowed; section 10 -makes it unnecessary. - - 3. Protecting Users' Legal Rights From Anti-Circumvention Law. - - No covered work shall be deemed part of an effective technological -measure under any applicable law fulfilling obligations under article -11 of the WIPO copyright treaty adopted on 20 December 1996, or -similar laws prohibiting or restricting circumvention of such -measures. - - When you convey a covered work, you waive any legal power to forbid -circumvention of technological measures to the extent such circumvention -is effected by exercising rights under this License with respect to -the covered work, and you disclaim any intention to limit operation or -modification of the work as a means of enforcing, against the work's -users, your or third parties' legal rights to forbid circumvention of -technological measures. - - 4. Conveying Verbatim Copies. - - You may convey verbatim copies of the Program's source code as you -receive it, in any medium, provided that you conspicuously and -appropriately publish on each copy an appropriate copyright notice; -keep intact all notices stating that this License and any -non-permissive terms added in accord with section 7 apply to the code; -keep intact all notices of the absence of any warranty; and give all -recipients a copy of this License along with the Program. - - You may charge any price or no price for each copy that you convey, -and you may offer support or warranty protection for a fee. - - 5. Conveying Modified Source Versions. - - You may convey a work based on the Program, or the modifications to -produce it from the Program, in the form of source code under the -terms of section 4, provided that you also meet all of these conditions: - - a) The work must carry prominent notices stating that you modified - it, and giving a relevant date. - - b) The work must carry prominent notices stating that it is - released under this License and any conditions added under section - 7. This requirement modifies the requirement in section 4 to - "keep intact all notices". - - c) You must license the entire work, as a whole, under this - License to anyone who comes into possession of a copy. This - License will therefore apply, along with any applicable section 7 - additional terms, to the whole of the work, and all its parts, - regardless of how they are packaged. This License gives no - permission to license the work in any other way, but it does not - invalidate such permission if you have separately received it. - - d) If the work has interactive user interfaces, each must display - Appropriate Legal Notices; however, if the Program has interactive - interfaces that do not display Appropriate Legal Notices, your - work need not make them do so. - - A compilation of a covered work with other separate and independent -works, which are not by their nature extensions of the covered work, -and which are not combined with it such as to form a larger program, -in or on a volume of a storage or distribution medium, is called an -"aggregate" if the compilation and its resulting copyright are not -used to limit the access or legal rights of the compilation's users -beyond what the individual works permit. Inclusion of a covered work -in an aggregate does not cause this License to apply to the other -parts of the aggregate. - - 6. Conveying Non-Source Forms. - - You may convey a covered work in object code form under the terms -of sections 4 and 5, provided that you also convey the -machine-readable Corresponding Source under the terms of this License, -in one of these ways: - - a) Convey the object code in, or embodied in, a physical product - (including a physical distribution medium), accompanied by the - Corresponding Source fixed on a durable physical medium - customarily used for software interchange. - - b) Convey the object code in, or embodied in, a physical product - (including a physical distribution medium), accompanied by a - written offer, valid for at least three years and valid for as - long as you offer spare parts or customer support for that product - model, to give anyone who possesses the object code either (1) a - copy of the Corresponding Source for all the software in the - product that is covered by this License, on a durable physical - medium customarily used for software interchange, for a price no - more than your reasonable cost of physically performing this - conveying of source, or (2) access to copy the - Corresponding Source from a network server at no charge. - - c) Convey individual copies of the object code with a copy of the - written offer to provide the Corresponding Source. This - alternative is allowed only occasionally and noncommercially, and - only if you received the object code with such an offer, in accord - with subsection 6b. - - d) Convey the object code by offering access from a designated - place (gratis or for a charge), and offer equivalent access to the - Corresponding Source in the same way through the same place at no - further charge. You need not require recipients to copy the - Corresponding Source along with the object code. If the place to - copy the object code is a network server, the Corresponding Source - may be on a different server (operated by you or a third party) - that supports equivalent copying facilities, provided you maintain - clear directions next to the object code saying where to find the - Corresponding Source. Regardless of what server hosts the - Corresponding Source, you remain obligated to ensure that it is - available for as long as needed to satisfy these requirements. - - e) Convey the object code using peer-to-peer transmission, provided - you inform other peers where the object code and Corresponding - Source of the work are being offered to the general public at no - charge under subsection 6d. - - A separable portion of the object code, whose source code is excluded -from the Corresponding Source as a System Library, need not be -included in conveying the object code work. - - A "User Product" is either (1) a "consumer product", which means any -tangible personal property which is normally used for personal, family, -or household purposes, or (2) anything designed or sold for incorporation -into a dwelling. In determining whether a product is a consumer product, -doubtful cases shall be resolved in favor of coverage. For a particular -product received by a particular user, "normally used" refers to a -typical or common use of that class of product, regardless of the status -of the particular user or of the way in which the particular user -actually uses, or expects or is expected to use, the product. A product -is a consumer product regardless of whether the product has substantial -commercial, industrial or non-consumer uses, unless such uses represent -the only significant mode of use of the product. - - "Installation Information" for a User Product means any methods, -procedures, authorization keys, or other information required to install -and execute modified versions of a covered work in that User Product from -a modified version of its Corresponding Source. The information must -suffice to ensure that the continued functioning of the modified object -code is in no case prevented or interfered with solely because -modification has been made. - - If you convey an object code work under this section in, or with, or -specifically for use in, a User Product, and the conveying occurs as -part of a transaction in which the right of possession and use of the -User Product is transferred to the recipient in perpetuity or for a -fixed term (regardless of how the transaction is characterized), the -Corresponding Source conveyed under this section must be accompanied -by the Installation Information. But this requirement does not apply -if neither you nor any third party retains the ability to install -modified object code on the User Product (for example, the work has -been installed in ROM). - - The requirement to provide Installation Information does not include a -requirement to continue to provide support service, warranty, or updates -for a work that has been modified or installed by the recipient, or for -the User Product in which it has been modified or installed. Access to a -network may be denied when the modification itself materially and -adversely affects the operation of the network or violates the rules and -protocols for communication across the network. - - Corresponding Source conveyed, and Installation Information provided, -in accord with this section must be in a format that is publicly -documented (and with an implementation available to the public in -source code form), and must require no special password or key for -unpacking, reading or copying. - - 7. Additional Terms. - - "Additional permissions" are terms that supplement the terms of this -License by making exceptions from one or more of its conditions. -Additional permissions that are applicable to the entire Program shall -be treated as though they were included in this License, to the extent -that they are valid under applicable law. If additional permissions -apply only to part of the Program, that part may be used separately -under those permissions, but the entire Program remains governed by -this License without regard to the additional permissions. - - When you convey a copy of a covered work, you may at your option -remove any additional permissions from that copy, or from any part of -it. (Additional permissions may be written to require their own -removal in certain cases when you modify the work.) You may place -additional permissions on material, added by you to a covered work, -for which you have or can give appropriate copyright permission. - - Notwithstanding any other provision of this License, for material you -add to a covered work, you may (if authorized by the copyright holders of -that material) supplement the terms of this License with terms: - - a) Disclaiming warranty or limiting liability differently from the - terms of sections 15 and 16 of this License; or - - b) Requiring preservation of specified reasonable legal notices or - author attributions in that material or in the Appropriate Legal - Notices displayed by works containing it; or - - c) Prohibiting misrepresentation of the origin of that material, or - requiring that modified versions of such material be marked in - reasonable ways as different from the original version; or - - d) Limiting the use for publicity purposes of names of licensors or - authors of the material; or - - e) Declining to grant rights under trademark law for use of some - trade names, trademarks, or service marks; or - - f) Requiring indemnification of licensors and authors of that - material by anyone who conveys the material (or modified versions of - it) with contractual assumptions of liability to the recipient, for - any liability that these contractual assumptions directly impose on - those licensors and authors. - - All other non-permissive additional terms are considered "further -restrictions" within the meaning of section 10. If the Program as you -received it, or any part of it, contains a notice stating that it is -governed by this License along with a term that is a further -restriction, you may remove that term. If a license document contains -a further restriction but permits relicensing or conveying under this -License, you may add to a covered work material governed by the terms -of that license document, provided that the further restriction does -not survive such relicensing or conveying. - - If you add terms to a covered work in accord with this section, you -must place, in the relevant source files, a statement of the -additional terms that apply to those files, or a notice indicating -where to find the applicable terms. - - Additional terms, permissive or non-permissive, may be stated in the -form of a separately written license, or stated as exceptions; -the above requirements apply either way. - - 8. Termination. - - You may not propagate or modify a covered work except as expressly -provided under this License. Any attempt otherwise to propagate or -modify it is void, and will automatically terminate your rights under -this License (including any patent licenses granted under the third -paragraph of section 11). - - However, if you cease all violation of this License, then your -license from a particular copyright holder is reinstated (a) -provisionally, unless and until the copyright holder explicitly and -finally terminates your license, and (b) permanently, if the copyright -holder fails to notify you of the violation by some reasonable means -prior to 60 days after the cessation. - - Moreover, your license from a particular copyright holder is -reinstated permanently if the copyright holder notifies you of the -violation by some reasonable means, this is the first time you have -received notice of violation of this License (for any work) from that -copyright holder, and you cure the violation prior to 30 days after -your receipt of the notice. - - Termination of your rights under this section does not terminate the -licenses of parties who have received copies or rights from you under -this License. If your rights have been terminated and not permanently -reinstated, you do not qualify to receive new licenses for the same -material under section 10. - - 9. Acceptance Not Required for Having Copies. - - You are not required to accept this License in order to receive or -run a copy of the Program. Ancillary propagation of a covered work -occurring solely as a consequence of using peer-to-peer transmission -to receive a copy likewise does not require acceptance. However, -nothing other than this License grants you permission to propagate or -modify any covered work. These actions infringe copyright if you do -not accept this License. Therefore, by modifying or propagating a -covered work, you indicate your acceptance of this License to do so. - - 10. Automatic Licensing of Downstream Recipients. - - Each time you convey a covered work, the recipient automatically -receives a license from the original licensors, to run, modify and -propagate that work, subject to this License. You are not responsible -for enforcing compliance by third parties with this License. - - An "entity transaction" is a transaction transferring control of an -organization, or substantially all assets of one, or subdividing an -organization, or merging organizations. If propagation of a covered -work results from an entity transaction, each party to that -transaction who receives a copy of the work also receives whatever -licenses to the work the party's predecessor in interest had or could -give under the previous paragraph, plus a right to possession of the -Corresponding Source of the work from the predecessor in interest, if -the predecessor has it or can get it with reasonable efforts. - - You may not impose any further restrictions on the exercise of the -rights granted or affirmed under this License. For example, you may -not impose a license fee, royalty, or other charge for exercise of -rights granted under this License, and you may not initiate litigation -(including a cross-claim or counterclaim in a lawsuit) alleging that -any patent claim is infringed by making, using, selling, offering for -sale, or importing the Program or any portion of it. - - 11. Patents. - - A "contributor" is a copyright holder who authorizes use under this -License of the Program or a work on which the Program is based. The -work thus licensed is called the contributor's "contributor version". - - A contributor's "essential patent claims" are all patent claims -owned or controlled by the contributor, whether already acquired or -hereafter acquired, that would be infringed by some manner, permitted -by this License, of making, using, or selling its contributor version, -but do not include claims that would be infringed only as a -consequence of further modification of the contributor version. For -purposes of this definition, "control" includes the right to grant -patent sublicenses in a manner consistent with the requirements of -this License. - - Each contributor grants you a non-exclusive, worldwide, royalty-free -patent license under the contributor's essential patent claims, to -make, use, sell, offer for sale, import and otherwise run, modify and -propagate the contents of its contributor version. - - In the following three paragraphs, a "patent license" is any express -agreement or commitment, however denominated, not to enforce a patent -(such as an express permission to practice a patent or covenant not to -sue for patent infringement). To "grant" such a patent license to a -party means to make such an agreement or commitment not to enforce a -patent against the party. - - If you convey a covered work, knowingly relying on a patent license, -and the Corresponding Source of the work is not available for anyone -to copy, free of charge and under the terms of this License, through a -publicly available network server or other readily accessible means, -then you must either (1) cause the Corresponding Source to be so -available, or (2) arrange to deprive yourself of the benefit of the -patent license for this particular work, or (3) arrange, in a manner -consistent with the requirements of this License, to extend the patent -license to downstream recipients. "Knowingly relying" means you have -actual knowledge that, but for the patent license, your conveying the -covered work in a country, or your recipient's use of the covered work -in a country, would infringe one or more identifiable patents in that -country that you have reason to believe are valid. - - If, pursuant to or in connection with a single transaction or -arrangement, you convey, or propagate by procuring conveyance of, a -covered work, and grant a patent license to some of the parties -receiving the covered work authorizing them to use, propagate, modify -or convey a specific copy of the covered work, then the patent license -you grant is automatically extended to all recipients of the covered -work and works based on it. - - A patent license is "discriminatory" if it does not include within -the scope of its coverage, prohibits the exercise of, or is -conditioned on the non-exercise of one or more of the rights that are -specifically granted under this License. You may not convey a covered -work if you are a party to an arrangement with a third party that is -in the business of distributing software, under which you make payment -to the third party based on the extent of your activity of conveying -the work, and under which the third party grants, to any of the -parties who would receive the covered work from you, a discriminatory -patent license (a) in connection with copies of the covered work -conveyed by you (or copies made from those copies), or (b) primarily -for and in connection with specific products or compilations that -contain the covered work, unless you entered into that arrangement, -or that patent license was granted, prior to 28 March 2007. - - Nothing in this License shall be construed as excluding or limiting -any implied license or other defenses to infringement that may -otherwise be available to you under applicable patent law. - - 12. No Surrender of Others' Freedom. - - If conditions are imposed on you (whether by court order, agreement or -otherwise) that contradict the conditions of this License, they do not -excuse you from the conditions of this License. If you cannot convey a -covered work so as to satisfy simultaneously your obligations under this -License and any other pertinent obligations, then as a consequence you may -not convey it at all. For example, if you agree to terms that obligate you -to collect a royalty for further conveying from those to whom you convey -the Program, the only way you could satisfy both those terms and this -License would be to refrain entirely from conveying the Program. - - 13. Use with the GNU Affero General Public License. - - Notwithstanding any other provision of this License, you have -permission to link or combine any covered work with a work licensed -under version 3 of the GNU Affero General Public License into a single -combined work, and to convey the resulting work. The terms of this -License will continue to apply to the part which is the covered work, -but the special requirements of the GNU Affero General Public License, -section 13, concerning interaction through a network will apply to the -combination as such. - - 14. Revised Versions of this License. - - The Free Software Foundation may publish revised and/or new versions of -the GNU General Public License from time to time. Such new versions will -be similar in spirit to the present version, but may differ in detail to -address new problems or concerns. - - Each version is given a distinguishing version number. If the -Program specifies that a certain numbered version of the GNU General -Public License "or any later version" applies to it, you have the -option of following the terms and conditions either of that numbered -version or of any later version published by the Free Software -Foundation. If the Program does not specify a version number of the -GNU General Public License, you may choose any version ever published -by the Free Software Foundation. - - If the Program specifies that a proxy can decide which future -versions of the GNU General Public License can be used, that proxy's -public statement of acceptance of a version permanently authorizes you -to choose that version for the Program. - - Later license versions may give you additional or different -permissions. However, no additional obligations are imposed on any -author or copyright holder as a result of your choosing to follow a -later version. - - 15. Disclaimer of Warranty. - - THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY -APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT -HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY -OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, -THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR -PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM -IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF -ALL NECESSARY SERVICING, REPAIR OR CORRECTION. - - 16. Limitation of Liability. - - IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING -WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS -THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY -GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE -USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF -DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD -PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), -EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF -SUCH DAMAGES. - - 17. Interpretation of Sections 15 and 16. - - If the disclaimer of warranty and limitation of liability provided -above cannot be given local legal effect according to their terms, -reviewing courts shall apply local law that most closely approximates -an absolute waiver of all civil liability in connection with the -Program, unless a warranty or assumption of liability accompanies a -copy of the Program in return for a fee. - - END OF TERMS AND CONDITIONS - - How to Apply These Terms to Your New Programs - - If you develop a new program, and you want it to be of the greatest -possible use to the public, the best way to achieve this is to make it -free software which everyone can redistribute and change under these terms. - - To do so, attach the following notices to the program. It is safest -to attach them to the start of each source file to most effectively -state the exclusion of warranty; and each file should have at least -the "copyright" line and a pointer to where the full notice is found. - - - Copyright (C) - - This program is free software: you can redistribute it and/or modify - it under the terms of the GNU General Public License as published by - the Free Software Foundation, either version 3 of the License, or - (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program. If not, see . - -Also add information on how to contact you by electronic and paper mail. - - If the program does terminal interaction, make it output a short -notice like this when it starts in an interactive mode: - - Copyright (C) - This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. - This is free software, and you are welcome to redistribute it - under certain conditions; type `show c' for details. - -The hypothetical commands `show w' and `show c' should show the appropriate -parts of the General Public License. Of course, your program's commands -might be different; for a GUI interface, you would use an "about box". - - You should also get your employer (if you work as a programmer) or school, -if any, to sign a "copyright disclaimer" for the program, if necessary. -For more information on this, and how to apply and follow the GNU GPL, see -. - - The GNU General Public License does not permit incorporating your program -into proprietary programs. If your program is a subroutine library, you -may consider it more useful to permit linking proprietary applications with -the library. If this is what you want to do, use the GNU Lesser General -Public License instead of this License. But first, please read -. diff --git a/PKG-INFO b/PKG-INFO index 3ed37ae..4dbe2ff 100644 --- a/PKG-INFO +++ b/PKG-INFO @@ -1,95 +1,99 @@ Metadata-Version: 2.1 Name: swh.loader.git -Version: 0.0.43 +Version: 0.0.48 Summary: Software Heritage git loader Home-page: https://forge.softwareheritage.org/diffusion/DLDG/ Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN +Project-URL: Source, https://forge.softwareheritage.org/source/swh-loader-git Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate -Project-URL: Source, https://forge.softwareheritage.org/source/swh-loader-git Description: swh-loader-git ============== The Software Heritage Git Loader is a tool and a library to walk a local Git repository and inject into the SWH dataset all contained files that weren't known before. License ------- This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. See top-level LICENSE file for the full text of the GNU General Public License along with this program. Dependencies ------------ ### Runtime - python3 - python3-dulwich - python3-retrying - python3-swh.core - python3-swh.model - python3-swh.storage - python3-swh.scheduler ### Test - python3-nose Requirements ------------ - implementation language, Python3 - coding guidelines: conform to PEP8 - Git access: via dulwich Configuration ------------- - You can run the loader or the updater directly by calling: + You can run the loader from a remote origin (*loader*) or from an + origin on disk (*from_disk*) directly by calling: + + ``` - python3 -m swh.loader.git.{loader,updater} + python3 -m swh.loader.git.{loader,from_disk} ``` ### Location Both tools expect a configuration file. Either one of the following location: - /etc/softwareheritage/ - ~/.config/swh/ - ~/.swh/ Note: Will call that location $SWH_CONFIG_PATH ### Configuration sample - $SWH_CONFIG_PATH/loader/git-{loader,updater}.yml: + Respectively the loader from a remote (`git.yml`) and the loader from + a disk (`git-disk.yml`), $SWH_CONFIG_PATH/loader/git{-disk}.yml: ``` storage: cls: remote args: url: http://localhost:5002/ ``` Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Description-Content-Type: text/markdown Provides-Extra: testing diff --git a/README.md b/README.md index 5048f90..7ab46d5 100644 --- a/README.md +++ b/README.md @@ -1,75 +1,79 @@ swh-loader-git ============== The Software Heritage Git Loader is a tool and a library to walk a local Git repository and inject into the SWH dataset all contained files that weren't known before. License ------- This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. See top-level LICENSE file for the full text of the GNU General Public License along with this program. Dependencies ------------ ### Runtime - python3 - python3-dulwich - python3-retrying - python3-swh.core - python3-swh.model - python3-swh.storage - python3-swh.scheduler ### Test - python3-nose Requirements ------------ - implementation language, Python3 - coding guidelines: conform to PEP8 - Git access: via dulwich Configuration ------------- -You can run the loader or the updater directly by calling: +You can run the loader from a remote origin (*loader*) or from an +origin on disk (*from_disk*) directly by calling: + + ``` -python3 -m swh.loader.git.{loader,updater} +python3 -m swh.loader.git.{loader,from_disk} ``` ### Location Both tools expect a configuration file. Either one of the following location: - /etc/softwareheritage/ - ~/.config/swh/ - ~/.swh/ Note: Will call that location $SWH_CONFIG_PATH ### Configuration sample -$SWH_CONFIG_PATH/loader/git-{loader,updater}.yml: +Respectively the loader from a remote (`git.yml`) and the loader from +a disk (`git-disk.yml`), $SWH_CONFIG_PATH/loader/git{-disk}.yml: ``` storage: cls: remote args: url: http://localhost:5002/ ``` diff --git a/bin/dir-git-repo-meta.sh b/bin/dir-git-repo-meta.sh deleted file mode 100755 index 9c5617c..0000000 --- a/bin/dir-git-repo-meta.sh +++ /dev/null @@ -1,31 +0,0 @@ -#!/usr/bin/env bash - -# count the number of type (tree, blob, tag, commit) -REPO=${1-`pwd`} -TYPE=${2-"all"} - -data() { - git rev-list --objects --all \ - | git cat-file --batch-check='%(objectname) %(objecttype) %(rest)' \ - | cut -f2 -d' ' \ - | grep $1 \ - | wc -l -} - -cd $REPO - -if [ "$TYPE" = "all" ]; then - NB_BLOBS=$(data "blob") - NB_TREES=$(data "tree") - NB_COMMITS=$(data "commit") - NB_TAGS=$(data "tag") - cat < Fri, 25 Sep 2015 15:55:09 +0200 diff --git a/debian/compat b/debian/compat deleted file mode 100644 index ec63514..0000000 --- a/debian/compat +++ /dev/null @@ -1 +0,0 @@ -9 diff --git a/debian/control b/debian/control deleted file mode 100644 index 490735d..0000000 --- a/debian/control +++ /dev/null @@ -1,31 +0,0 @@ -Source: swh-loader-git -Maintainer: Software Heritage developers -Section: python -Priority: optional -Build-Depends: debhelper (>= 9), - dh-python (>= 2), - python3-all, - python3-click, - python3-dulwich (>= 0.18.7~), - python3-nose, - python3-retrying, - python3-setuptools, - python3-swh.core (>= 0.0.7~), - python3-swh.loader.core (>= 0.0.32), - python3-swh.model (>= 0.0.27~), - python3-swh.scheduler (>= 0.0.14~), - python3-swh.storage (>= 0.0.108~), - python3-vcversioner -Standards-Version: 3.9.6 -Homepage: https://forge.softwareheritage.org/diffusion/DLDG/ - -Package: python3-swh.loader.git -Architecture: all -Depends: python3-swh.core (>= 0.0.7~), - python3-swh.loader.core (>= 0.0.32~), - python3-swh.model (>= 0.0.27~), - python3-swh.scheduler (>= 0.0.14~), - python3-swh.storage (>= 0.0.108~), - ${misc:Depends}, - ${python3:Depends} -Description: Software Heritage Git loader diff --git a/debian/copyright b/debian/copyright deleted file mode 100644 index 81d037d..0000000 --- a/debian/copyright +++ /dev/null @@ -1,22 +0,0 @@ -Format: http://www.debian.org/doc/packaging-manuals/copyright-format/1.0/ - -Files: * -Copyright: 2015 The Software Heritage developers -License: GPL-3+ - -License: GPL-3+ - This program is free software: you can redistribute it and/or modify - it under the terms of the GNU General Public License as published by - the Free Software Foundation; either version 3 of the License, or - (at your option) any later version. - . - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - . - You should have received a copy of the GNU General Public License - along with this program. If not, see . - . - On Debian systems, the complete text of the GNU General Public - License version 3 can be found in `/usr/share/common-licenses/GPL-3'. diff --git a/debian/rules b/debian/rules deleted file mode 100755 index 7803287..0000000 --- a/debian/rules +++ /dev/null @@ -1,12 +0,0 @@ -#!/usr/bin/make -f - -export PYBUILD_NAME=swh.loader.git -export PYBUILD_TEST_ARGS=--with-doctest -sv -a !db,!fs - -%: - dh $@ --with python3 --buildsystem=pybuild - -override_dh_install: - dh_install - rm -v $(CURDIR)/debian/python3-*/usr/lib/python*/dist-packages/swh/__init__.py - rm -v $(CURDIR)/debian/python3-*/usr/lib/python*/dist-packages/swh/loader/__init__.py diff --git a/debian/source/format b/debian/source/format deleted file mode 100644 index 163aaf8..0000000 --- a/debian/source/format +++ /dev/null @@ -1 +0,0 @@ -3.0 (quilt) diff --git a/docs/.gitignore b/docs/.gitignore deleted file mode 100644 index 58a761e..0000000 --- a/docs/.gitignore +++ /dev/null @@ -1,3 +0,0 @@ -_build/ -apidoc/ -*-stamp diff --git a/docs/Makefile b/docs/Makefile deleted file mode 100644 index c30c50a..0000000 --- a/docs/Makefile +++ /dev/null @@ -1 +0,0 @@ -include ../../swh-docs/Makefile.sphinx diff --git a/docs/_static/.placeholder b/docs/_static/.placeholder deleted file mode 100644 index e69de29..0000000 diff --git a/docs/_templates/.placeholder b/docs/_templates/.placeholder deleted file mode 100644 index e69de29..0000000 diff --git a/docs/attic/api-backend-protocol.txt b/docs/attic/api-backend-protocol.txt deleted file mode 100644 index cb5bc49..0000000 --- a/docs/attic/api-backend-protocol.txt +++ /dev/null @@ -1,195 +0,0 @@ -Design considerations -===================== - -# Goal - -Load the representation of a git, svn, csv, tarball, et al. repository in -software heritage's backend. - -# Nomenclature - -cf. swh-sql/swh.sql comments --> FIXME: find a means to compute docs from sql - -From this point on, `signatures` means: -- the git sha1s, the sha1 and sha256 the object's content for object of type -content -- the git sha1s for all other object types (directories, contents, revisions, -occurrences, releases) - -A worker is one instance running swh-loader-git to parse and load a repository -in the backend. It is not distributed. - -The backend api discuss with one or many workers. -It is distributed. - -# Scenario - -In the following, we will describe with different granularities what will -happen between 1 worker and the backend api. - -## 1 - -A worker parses a repository. -It sends the parsing result to the backend in muliple requests/responses. -The worker sends list of sha1s (git sha1s) encountered. -The server responds with an unknowns sha1s list. -The worker sends those sha1s and their associated data to the server. -The server store what it receives. - -## 2 - -01. Worker parses local repository and build a memory model of it. - -02. HAVE: Worker sends repository's contents signatures to the backend for it -to filter what it knows. -03. WANT: Backend replies with unknown contents sha1s. -04. SAVE: Worker sends all `content` data through 1 (or more) request(s). -05. SAVED: Backend stores them and finish the transaction(s). - -06. HAVE: Worker sends repository's directories' signatures to the backend for -it to filter. -07. WANT: Backend replies with unknown directory sha1s. -08. SAVE: Worker sends all `directory`s' data through 1 (or more) request(s). -09. SAVED: Backend stores them and finish the transaction(s). - -10. HAVE: Worker sends repository's revisions' signatures to the backend. -11. WANT: Backend replies with unknown revisions' sha1s. -12. SAVE: Worker sends the `revision`s' data through 1 (or more) request(s). -13. SAVED: Backend stores them and finish the transaction(s). - -14. SAVE: Worker sends repository's occurrences for the backend to save what it -does not know yet. -15. SAVE: Worker sends repository's releases for the backend to save what it -does not know yet. -16. Worker is done. - -## 3 - -01. Worker parses repository and builds a data memory model. -The data memory model has the following structure for each possible type: -- signatures list -- map indexed by git sha1, object representation. -Type of object ; content, directory, revision, release, occurence is kept. - -02. Worker sends in the api backend's protocol the sha1s. - -03. Api Backend receives the list of sha1s, filters out -unknown sha1s and replies to the worker. - -04. Worker receives the list of unknown sha1s. -The worker builds the unknowns `content`s' list. - -A list of contents, for each content: -- git's sha1 (when parsing git repository) -- sha1 content (as per content's sha1) -- sha256 content -- content's size -- content - -And sends it to the api's backend. - -05. Backend receives the data and: -- computes from the `content` the signatures (sha1, sha256). FIXME: Not implemented yet -- checks the signatures match the client's data FIXME: Not Implemented yet -- Stores the content on the file storage -- Persist in the db the received data -If any errors is detected during the process (checksum do not match, writing -error, ...), the db transaction is rollbacked and a failure is sent to the -client. -Otherwise, the db transaction is committed and a success is sent back to the -client. - -*Note* Optimization possible: slice in multiple queries. - -06. Worker receives the result from the api. -If failure, worker stops. The task is done. -Otherwise, the worker continues by sending the list of `directory` structure. - -A list of directories, for each directory: -- sha1 -- directory's content -- list of directory entries: - - name : relative path to parent entry or root - - sha1 : pointer to the object this directory points to - - type : whether entry is a file or a dir - - perms : unix-like permissions - - atime : time of last access FIXME: Not the right time yet - - mtime : time of last modification FIXME: Not the right time yet - - ctime : time of last status change FIXME: Not the right time yet - - directory: parent directory sha1 - -And sends it to the api's backend. - -*Note* Optimization possible: slice in multiple queries. - -07. Api backend receives the data. -Persists the directory's content on the file storage. -Persist the directory and directory entries on the db's side in respect to the -previous directories and contents stored. - -If any error is raised, the transaction is rollbacked and an error is sent back -to the client (worker). -Otherwise, the transaction is committed and the success is sent back to the -client. - -08. Worker receives the result from the api. -If failure, worker stops. The task is done. -Otherwise, the worker continues by building the list of unknown `revision`s. - -A list of revisions, for each revision: -- sha1, the revision's sha1 -- revision's parent sha1s, the list of revision parents -- content, the revision's content -- revision's date -- directory id the revision points to -- message, the revision's message -- author -- committer - -And sends it to the api's backend. - -*Note* Optimization possible: slice in multiple queries. - -09. Api backend receives data. -Persists the revisions' content on the file storage. -Persist the directory and directory entries on the db's side in respect to the -previous directories and contents stored. - -If any error is raised, the transaction is rollbacked and an error is sent back -to the client (worker). -Otherwise, the transaction is committed and the success is sent back to the -client. - -10. Worker receives the result. Worker sends the complete occurrences list. - -A list of occurrences, for each occurrence: -- sha1, the sha1 the occurrences points to -- reference, the occurrence's name -- url-origin, the origin of the repository - - -11. The backend receives the list of occurrences and persist only what it does -not know. Acks the result to the backend. - -12. Worker sends the complete releases list. - -A list of releases, for each release: -- sha1, the release sha1 -- content, the content of the appointed commit -- revision, the sha1 the release points to -- name, the release's name -- date, the release's date # FIXME: find the tag's date, -- author, the release's author information -- comment, the release's message - -13. The backend receives the list of releases and persists only what it does -not know. Acks the result to the backend. - -14. Worker received the result and stops anyway. The task is done. - -## Protocol details - -- worker serializes the content's payload (python data structure) as pickle -format -- backend unserializes the request's payload as python data structure diff --git a/docs/attic/git-loading-design.txt b/docs/attic/git-loading-design.txt deleted file mode 100644 index dfcaee4..0000000 --- a/docs/attic/git-loading-design.txt +++ /dev/null @@ -1,136 +0,0 @@ -Design considerations -===================== - -* **Caching**: our storage contains two main parts: a file storage, and a git - object storage. Both parts are accessible as key-value storage. Whenever - possible we want to avoid checking in content that is provably already in - there. - -* **Concurrency**: our storage will be accessed concurrently for both read and - write purposes. In particular for writing, it is possible that multiple - workers will be in the process, at the same time, of loading into our storage - different git repositories that have a lot of overlap, if not completely - identical. Whenever possible they should be able to benefit from each other's - work, collaborating *de facto*. - -* **Robustness**: workers that load content into our storage might crash before - completion. Wherever possible, the work done before completion should be - preserved by the storage. Eventually another worker (possibly the same as - before) will pickup the same git repository, try it again, and drive it to - completion. - -* **Ctime and Atimes**: for every piece of content we are interested in both - creation time ("ctime", i.e., the first time we have seen a given content and - added it to our storage) and access times ("atime", i.e., every time we see - the same content *elsewhere* we want to be able to store the fact we have - seen it *again*, and again, and again...). - -* **Transactionality**: every content addition should be transactional: only - after having stored the content in full will we tell the world (and other - workers) that the content is available. (Without locking, which is desirable) - This might result in temporary races where multiple workers trying to add the - same content without knowing of each other---this situation should be handled - gracefully. - - Transactionality should apply across different storage media: in particular - the filesystem used to store file content and the DB used to store the - corresponding metadata should cooperate. It is OK for the filesystem to have - content that is not indexed in the DB; but for all purposes that should be - equivalent to not having stored the content *at all*. - - -Git traversal -============= - -To load the whole content of a git repo in our storage we need to traverse the -git object graph, and inject every single version of every single file we find. -We discuss below how we should traverse the git graph to that end. - -For the sake of conciseness we do not distinguish git object types. The actual -code does need to treat differently different kind of git objects though (and -in particular commits -> trees -> and blobs); see the implementation for -details about this. - - -Top-down --------- - -* Top-down, topological (latest first) traversal of the git object graph - starting from the current refs is optimal from the point of view of caching. - Once a given object is found in the cache we know that we have already loaded - it in the storage and we do not need to treat its parents any further. - -* Top-down however is not good for robustness. If we store the current node - before its parents and the loading fails to complete, in the future we - will believe to have stored all its parents whereas we have not. - FAIL. - -Conclusion: pure top-down traversal is bad for us. - - -Bottom-up ---------- - -* Bottom-up, topological traversal is good for robustness. Once we reach the - top we know we have stored all its parents, so in the future we can - benefit from caching. - -* However, bottom-up is bad for caching. If we always treat parents before - descendants, we will benefit from caching only at the level of individual - objects, and never at the level of whole subgraphs. - -Conclusion: pure bottom-up traversal is OK, but does not allow to benefit from -subgraph caching. - - -Mixed top-down/bottom-up ------------------------- - -To get the best of both worlds we need a mixed approach, something like -(pseudocode): - - let rec load_git_graph node = - if not (node in storage) then - for parent in parents(node) - load_git_graph(parent) - add_to_storage(node, storage) - -Note: non tail-recursive. - -Conclusion: the above offers both robustness w.r.t. loading crashes and -subgraph caching. - - -Atime maintenance ------------------ - -Bad news: with the mixed approach it's easy to maintain ctimes, but atimes -cannot be maintained (because we do not visit at all subgraphs). More -generally: subgraph caching or atime maintenance <- choose one. - -If we do want to maintain atimes (at this level---as opposed to, say, do that -separately from git repo loading) we need to give up on subgraph caching. If we -do that, top-down vs bottom-up doesn't really matter. - - -Cross file system + DB transactions -=================================== - -To ensure file system DB transaction, to add a single file to our storage we -proceed as follows, where KEY is the key of the file to be added, and WORKER_ID -the unique identifier of the worker that is updating the storage: - -1. BEGIN TRANSACTION -2. create file KEY.WORKER_ID, overwriting destination if needed -3. write file content to KEY.WORKER_ID -4. rename(KEY.WORKER_ID, KEY), overwriting destination if needed -5. INSERT KEY INTO ... -6. COMMIT - -any error in the above would cause a transaction ABORT. - -Failure scenarios (that should all be handled properly by the above protocol): - -* worker crash, at any moment during the above -* parallel execution, resulting in one worker failing due to key duplication - upon step (5) or (6) diff --git a/docs/conf.py b/docs/conf.py deleted file mode 100644 index 190deb7..0000000 --- a/docs/conf.py +++ /dev/null @@ -1 +0,0 @@ -from swh.docs.sphinx.conf import * # NoQA diff --git a/docs/index.rst b/docs/index.rst deleted file mode 100644 index 4b1ed20..0000000 --- a/docs/index.rst +++ /dev/null @@ -1,19 +0,0 @@ -.. _swh-loader-git: - -Software Heritage - Git loader -============================== - -Loader for `Git `_ repositories. - - -.. toctree:: - :maxdepth: 2 - :caption: Contents: - - -Indices and tables -================== - -* :ref:`genindex` -* :ref:`modindex` -* :ref:`search` diff --git a/requirements-swh.txt b/requirements-swh.txt index c62d405..9cfd3cf 100644 --- a/requirements-swh.txt +++ b/requirements-swh.txt @@ -1,5 +1,5 @@ swh.core >= 0.0.7 -swh.loader.core >= 0.0.32 +swh.loader.core >= 0.0.37 swh.model >= 0.0.27 -swh.scheduler >= 0.0.14 +swh.scheduler >= 0.0.39 swh.storage >= 0.0.108 diff --git a/requirements-test.txt b/requirements-test.txt deleted file mode 100644 index f3c7e8e..0000000 --- a/requirements-test.txt +++ /dev/null @@ -1 +0,0 @@ -nose diff --git a/resources/local-loader-git.ini b/resources/local-loader-git.ini deleted file mode 100644 index da492be..0000000 --- a/resources/local-loader-git.ini +++ /dev/null @@ -1,10 +0,0 @@ -[main] -# Where to store the logs -log_dir = /tmp/swh-loader-git/log - -# how to access the backend (remote or local) -backend-type = local - -# backend-type remote: url access to api rest's backend -# backend-type local: configuration file to backend file .ini (cf. back.ini file) -backend = ~/.config/swh/back.ini diff --git a/resources/remote-loader-git.ini b/resources/remote-loader-git.ini deleted file mode 100644 index 223e9c1..0000000 --- a/resources/remote-loader-git.ini +++ /dev/null @@ -1,10 +0,0 @@ -[main] -# Where to store the logs -log_dir = /tmp/swh-loader-git/log - -# how to access the backend (remote or local) -backend-type = remote - -# backend-type remote: url access to api rest's backend -# backend-type local: configuration file to backend file .ini (cf. back.ini file) -backend = http://localhost:5000 diff --git a/resources/test/back.ini b/resources/test/back.ini deleted file mode 100644 index 927957e..0000000 --- a/resources/test/back.ini +++ /dev/null @@ -1,22 +0,0 @@ -[main] - -# where to store blob on disk -content_storage_dir = /tmp/swh-loader-git/test/content-storage - -# Where to store the logs -log_dir = /tmp/swh-loader-git/test/log - -# url access to db: dbname= (host= port= user= password=) -db_url = dbname=softwareheritage-dev-test - -# compute folder's depth on disk aa/bb/cc/dd -#folder_depth = 4 - -# To open to the world, 0.0.0.0 -#host = 127.0.0.1 - -# Debugger (for dev only) -debug = true - -# server port to listen to requests -port = 5001 diff --git a/resources/test/db-manager.ini b/resources/test/db-manager.ini deleted file mode 100644 index 679bc8c..0000000 --- a/resources/test/db-manager.ini +++ /dev/null @@ -1,7 +0,0 @@ -[main] - -# Where to store the logs -log_dir = swh-loader-git/log - -# url access to db -db_url = dbname=softwareheritage-dev-test diff --git a/resources/updater.ini b/resources/updater.ini deleted file mode 100644 index da492be..0000000 --- a/resources/updater.ini +++ /dev/null @@ -1,10 +0,0 @@ -[main] -# Where to store the logs -log_dir = /tmp/swh-loader-git/log - -# how to access the backend (remote or local) -backend-type = local - -# backend-type remote: url access to api rest's backend -# backend-type local: configuration file to backend file .ini (cf. back.ini file) -backend = ~/.config/swh/back.ini diff --git a/swh.loader.git.egg-info/PKG-INFO b/swh.loader.git.egg-info/PKG-INFO index 3ed37ae..4dbe2ff 100644 --- a/swh.loader.git.egg-info/PKG-INFO +++ b/swh.loader.git.egg-info/PKG-INFO @@ -1,95 +1,99 @@ Metadata-Version: 2.1 Name: swh.loader.git -Version: 0.0.43 +Version: 0.0.48 Summary: Software Heritage git loader Home-page: https://forge.softwareheritage.org/diffusion/DLDG/ Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN +Project-URL: Source, https://forge.softwareheritage.org/source/swh-loader-git Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate -Project-URL: Source, https://forge.softwareheritage.org/source/swh-loader-git Description: swh-loader-git ============== The Software Heritage Git Loader is a tool and a library to walk a local Git repository and inject into the SWH dataset all contained files that weren't known before. License ------- This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. See top-level LICENSE file for the full text of the GNU General Public License along with this program. Dependencies ------------ ### Runtime - python3 - python3-dulwich - python3-retrying - python3-swh.core - python3-swh.model - python3-swh.storage - python3-swh.scheduler ### Test - python3-nose Requirements ------------ - implementation language, Python3 - coding guidelines: conform to PEP8 - Git access: via dulwich Configuration ------------- - You can run the loader or the updater directly by calling: + You can run the loader from a remote origin (*loader*) or from an + origin on disk (*from_disk*) directly by calling: + + ``` - python3 -m swh.loader.git.{loader,updater} + python3 -m swh.loader.git.{loader,from_disk} ``` ### Location Both tools expect a configuration file. Either one of the following location: - /etc/softwareheritage/ - ~/.config/swh/ - ~/.swh/ Note: Will call that location $SWH_CONFIG_PATH ### Configuration sample - $SWH_CONFIG_PATH/loader/git-{loader,updater}.yml: + Respectively the loader from a remote (`git.yml`) and the loader from + a disk (`git-disk.yml`), $SWH_CONFIG_PATH/loader/git{-disk}.yml: ``` storage: cls: remote args: url: http://localhost:5002/ ``` Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Description-Content-Type: text/markdown Provides-Extra: testing diff --git a/swh.loader.git.egg-info/SOURCES.txt b/swh.loader.git.egg-info/SOURCES.txt index 1153563..8d37c5c 100644 --- a/swh.loader.git.egg-info/SOURCES.txt +++ b/swh.loader.git.egg-info/SOURCES.txt @@ -1,49 +1,28 @@ -.gitignore -AUTHORS -LICENSE MANIFEST.in Makefile README.md requirements-swh.txt -requirements-test.txt requirements.txt setup.py version.txt -bin/dir-git-repo-meta.sh -debian/changelog -debian/compat -debian/control -debian/copyright -debian/rules -debian/source/format -docs/.gitignore -docs/Makefile -docs/conf.py -docs/index.rst -docs/_static/.placeholder -docs/_templates/.placeholder -docs/attic/api-backend-protocol.txt -docs/attic/git-loading-design.txt -resources/local-loader-git.ini -resources/remote-loader-git.ini -resources/updater.ini -resources/test/back.ini -resources/test/db-manager.ini swh/__init__.py swh.loader.git.egg-info/PKG-INFO swh.loader.git.egg-info/SOURCES.txt swh.loader.git.egg-info/dependency_links.txt swh.loader.git.egg-info/requires.txt swh.loader.git.egg-info/top_level.txt swh/loader/__init__.py swh/loader/git/__init__.py swh/loader/git/converters.py +swh/loader/git/from_disk.py swh/loader/git/loader.py -swh/loader/git/reader.py swh/loader/git/tasks.py -swh/loader/git/updater.py swh/loader/git/utils.py swh/loader/git/tests/__init__.py +swh/loader/git/tests/conftest.py swh/loader/git/tests/test_converters.py +swh/loader/git/tests/test_from_disk.py +swh/loader/git/tests/test_loader.py +swh/loader/git/tests/test_tasks.py swh/loader/git/tests/test_utils.py swh/loader/git/tests/data/git-repos/example-submodule.fast-export.xz \ No newline at end of file diff --git a/swh.loader.git.egg-info/requires.txt b/swh.loader.git.egg-info/requires.txt index 821f385..ed968b5 100644 --- a/swh.loader.git.egg-info/requires.txt +++ b/swh.loader.git.egg-info/requires.txt @@ -1,12 +1,13 @@ -click dulwich>=0.18.7 retrying +vcversioner +click swh.core>=0.0.7 -swh.loader.core>=0.0.32 +swh.loader.core>=0.0.37 swh.model>=0.0.27 -swh.scheduler>=0.0.14 +swh.scheduler>=0.0.39 swh.storage>=0.0.108 -vcversioner [testing] -nose +pytest<4 +swh.scheduler[testing] diff --git a/swh/loader/git/loader.py b/swh/loader/git/from_disk.py similarity index 83% copy from swh/loader/git/loader.py copy to swh/loader/git/from_disk.py index 51dd375..b7bbdb3 100644 --- a/swh/loader/git/loader.py +++ b/swh/loader/git/from_disk.py @@ -1,307 +1,361 @@ # Copyright (C) 2015-2018 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import datetime import dulwich.repo import os import shutil from dulwich.errors import ObjectFormatException, EmptyFileException from collections import defaultdict from swh.model import hashutil -from swh.loader.core.loader import SWHStatelessLoader +from swh.loader.core.loader import UnbufferedLoader from . import converters, utils -class GitLoader(SWHStatelessLoader): +class GitLoaderFromDisk(UnbufferedLoader): """Load a git repository from a directory. + """ - CONFIG_BASE_FILENAME = 'loader/git-loader' + CONFIG_BASE_FILENAME = 'loader/git-disk' def __init__(self, config=None): super().__init__(logging_class='swh.loader.git.Loader', config=config) - def prepare_origin_visit(self, origin_url, directory, visit_date): + def _prepare_origin_visit(self, origin_url, visit_date): self.origin_url = origin_url self.origin = converters.origin_url_to_origin(self.origin_url) self.visit_date = visit_date + def prepare_origin_visit(self, origin_url, directory, visit_date): + self._prepare_origin_visit(origin_url, visit_date) + def prepare(self, origin_url, directory, visit_date): self.repo = dulwich.repo.Repo(directory) def iter_objects(self): object_store = self.repo.object_store for pack in object_store.packs: objs = list(pack.index.iterentries()) objs.sort(key=lambda x: x[1]) for sha, offset, crc32 in objs: yield hashutil.hash_to_bytehex(sha) yield from object_store._iter_loose_objects() yield from object_store._iter_alternate_objects() def _check(self, obj): """Check the object's repository representation. If any errors in check exists, an ObjectFormatException is raised. Args: obj (object): Dulwich object read from the repository. """ obj.check() from dulwich.objects import Commit, Tag try: # For additional checks on dulwich objects with date # for now, only checks on *time if isinstance(obj, Commit): commit_time = obj._commit_time utils.check_date_time(commit_time) author_time = obj._author_time utils.check_date_time(author_time) elif isinstance(obj, Tag): tag_time = obj._tag_time utils.check_date_time(tag_time) except Exception as e: raise ObjectFormatException(e) def get_object(self, oid): """Given an object id, return the object if it is found and not malformed in some way. Args: oid (bytes): the object's identifier Returns: The object if found without malformation """ try: # some errors are raised when reading the object obj = self.repo[oid] # some we need to check ourselves self._check(obj) except KeyError: _id = oid.decode('utf-8') self.log.warn('object %s not found, skipping' % _id, extra={ 'swh_type': 'swh_loader_git_missing_object', 'swh_object_id': _id, 'origin_id': self.origin_id, }) return None except ObjectFormatException: _id = oid.decode('utf-8') self.log.warn('object %s malformed, skipping' % _id, extra={ 'swh_type': 'swh_loader_git_missing_object', 'swh_object_id': _id, 'origin_id': self.origin_id, }) return None except EmptyFileException: _id = oid.decode('utf-8') self.log.warn('object %s corrupted (empty file), skipping' % _id, extra={ 'swh_type': 'swh_loader_git_missing_object', 'swh_object_id': _id, 'origin_id': self.origin_id, }) else: return obj def fetch_data(self): """Fetch the data from the data source""" self.previous_snapshot = self.storage.snapshot_get_latest( self.origin_id ) type_to_ids = defaultdict(list) for oid in self.iter_objects(): obj = self.get_object(oid) if not obj: continue type_name = obj.type_name type_to_ids[type_name].append(oid) self.type_to_ids = type_to_ids def has_contents(self): """Checks whether we need to load contents""" return bool(self.type_to_ids[b'blob']) def get_content_ids(self): """Get the content identifiers from the git repository""" for oid in self.type_to_ids[b'blob']: yield converters.dulwich_blob_to_content_id(self.repo[oid]) def get_contents(self): """Get the contents that need to be loaded""" max_content_size = self.config['content_size_limit'] missing_contents = set(self.storage.content_missing( self.get_content_ids(), 'sha1_git')) for oid in missing_contents: yield converters.dulwich_blob_to_content( self.repo[hashutil.hash_to_bytehex(oid)], log=self.log, max_content_size=max_content_size, origin_id=self.origin_id) def has_directories(self): """Checks whether we need to load directories""" return bool(self.type_to_ids[b'tree']) def get_directory_ids(self): """Get the directory identifiers from the git repository""" return (hashutil.hash_to_bytes(id.decode()) for id in self.type_to_ids[b'tree']) def get_directories(self): """Get the directories that need to be loaded""" missing_dirs = set(self.storage.directory_missing( sorted(self.get_directory_ids()))) for oid in missing_dirs: yield converters.dulwich_tree_to_directory( self.repo[hashutil.hash_to_bytehex(oid)], log=self.log) def has_revisions(self): """Checks whether we need to load revisions""" return bool(self.type_to_ids[b'commit']) def get_revision_ids(self): """Get the revision identifiers from the git repository""" return (hashutil.hash_to_bytes(id.decode()) for id in self.type_to_ids[b'commit']) def get_revisions(self): """Get the revisions that need to be loaded""" missing_revs = set(self.storage.revision_missing( sorted(self.get_revision_ids()))) for oid in missing_revs: yield converters.dulwich_commit_to_revision( self.repo[hashutil.hash_to_bytehex(oid)], log=self.log) def has_releases(self): """Checks whether we need to load releases""" return bool(self.type_to_ids[b'tag']) def get_release_ids(self): """Get the release identifiers from the git repository""" return (hashutil.hash_to_bytes(id.decode()) for id in self.type_to_ids[b'tag']) def get_releases(self): """Get the releases that need to be loaded""" missing_rels = set(self.storage.release_missing( sorted(self.get_release_ids()))) for oid in missing_rels: yield converters.dulwich_tag_to_release( self.repo[hashutil.hash_to_bytehex(oid)], log=self.log) def get_snapshot(self): """Turn the list of branches into a snapshot to load""" branches = {} for ref, target in self.repo.refs.as_dict().items(): obj = self.get_object(target) if obj: branches[ref] = { 'target': hashutil.bytehex_to_hash(target), 'target_type': converters.DULWICH_TYPES[obj.type_name], } else: branches[ref] = None self.snapshot = converters.branches_to_snapshot(branches) return self.snapshot def get_fetch_history_result(self): """Return the data to store in fetch_history for the current loader""" return { 'contents': len(self.type_to_ids[b'blob']), 'directories': len(self.type_to_ids[b'tree']), 'revisions': len(self.type_to_ids[b'commit']), 'releases': len(self.type_to_ids[b'tag']), } def save_data(self): """We already have the data locally, no need to save it""" pass def load_status(self): """The load was eventful if the current occurrences are different to the ones we retrieved at the beginning of the run""" eventful = False if self.previous_snapshot: eventful = self.snapshot['id'] != self.previous_snapshot['id'] else: eventful = bool(self.snapshot['branches']) return {'status': ('eventful' if eventful else 'uneventful')} -class GitLoaderFromArchive(GitLoader): +class GitLoaderFromArchive(GitLoaderFromDisk): """Load a git repository from an archive. + This loader ingests a git repository compressed into an archive. + The supported archive formats are ``.zip`` and ``.tar.gz``. + + From an input tarball named ``my-git-repo.zip``, the following layout is + expected in it:: + + my-git-repo/ + ├── .git + │ ├── branches + │ ├── COMMIT_EDITMSG + │ ├── config + │ ├── description + │ ├── HEAD + ... + + Nevertheless, the loader is able to ingest tarballs with the following + layouts too:: + + . + ├── .git + │ ├── branches + │ ├── COMMIT_EDITMSG + │ ├── config + │ ├── description + │ ├── HEAD + ... + + or:: + + other-repo-name/ + ├── .git + │ ├── branches + │ ├── COMMIT_EDITMSG + │ ├── config + │ ├── description + │ ├── HEAD + ... + """ + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + self.temp_dir = self.repo_path = None + def project_name_from_archive(self, archive_path): """Compute the project name from the archive's path. """ - return os.path.basename(os.path.dirname(archive_path)) + archive_name = os.path.basename(archive_path) + for ext in ('.zip', '.tar.gz', '.tgz'): + if archive_name.lower().endswith(ext): + archive_name = archive_name[:-len(ext)] + break + return archive_name + + def prepare_origin_visit(self, origin_url, archive_path, visit_date): + self._prepare_origin_visit(origin_url, visit_date) def prepare(self, origin_url, archive_path, visit_date): """1. Uncompress the archive in temporary location. - 2. Prepare as the GitLoader does - 3. Load as GitLoader does + 2. Prepare as the GitLoaderFromDisk does + 3. Load as GitLoaderFromDisk does """ project_name = self.project_name_from_archive(archive_path) self.temp_dir, self.repo_path = utils.init_git_repo_from_archive( project_name, archive_path) self.log.info('Project %s - Uncompressing archive %s at %s' % ( origin_url, os.path.basename(archive_path), self.repo_path)) super().prepare(origin_url, self.repo_path, visit_date) def cleanup(self): """Cleanup the temporary location (if it exists). """ if self.temp_dir and os.path.exists(self.temp_dir): shutil.rmtree(self.temp_dir) self.log.info('Project %s - Done injecting %s' % ( self.origin_url, self.repo_path)) if __name__ == '__main__': import click import logging logging.basicConfig( level=logging.DEBUG, format='%(asctime)s %(process)d %(message)s' ) @click.command() @click.option('--origin-url', help='origin url') @click.option('--git-directory', help='Path to git repository to load') @click.option('--visit-date', default=None, help='Visit date') def main(origin_url, git_directory, visit_date): if not visit_date: visit_date = datetime.datetime.now(tz=datetime.timezone.utc) - return GitLoader().load(origin_url, git_directory, visit_date) + return GitLoaderFromDisk().load(origin_url, git_directory, visit_date) main() diff --git a/swh/loader/git/loader.py b/swh/loader/git/loader.py index 51dd375..019fb97 100644 --- a/swh/loader/git/loader.py +++ b/swh/loader/git/loader.py @@ -1,307 +1,515 @@ -# Copyright (C) 2015-2018 The Software Heritage developers +# Copyright (C) 2016-2018 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import datetime -import dulwich.repo +import dulwich.client +import logging import os -import shutil +import pickle +import sys -from dulwich.errors import ObjectFormatException, EmptyFileException from collections import defaultdict +from io import BytesIO +from dulwich.object_store import ObjectStoreGraphWalker +from dulwich.pack import PackData, PackInflater from swh.model import hashutil -from swh.loader.core.loader import SWHStatelessLoader -from . import converters, utils +from swh.loader.core.loader import UnbufferedLoader +from swh.storage.algos.snapshot import snapshot_get_all_branches +from . import converters -class GitLoader(SWHStatelessLoader): - """Load a git repository from a directory. - """ +class RepoRepresentation: + """Repository representation for a Software Heritage origin.""" + def __init__(self, storage, origin_id, base_snapshot=None, + ignore_history=False): + self.storage = storage - CONFIG_BASE_FILENAME = 'loader/git-loader' + self._parents_cache = {} + self._type_cache = {} - def __init__(self, config=None): - super().__init__(logging_class='swh.loader.git.Loader', config=config) + self.ignore_history = ignore_history - def prepare_origin_visit(self, origin_url, directory, visit_date): - self.origin_url = origin_url - self.origin = converters.origin_url_to_origin(self.origin_url) - self.visit_date = visit_date + if origin_id and not ignore_history: + self.heads = set(self._cache_heads(origin_id, base_snapshot)) + else: + self.heads = set() + + def _fill_parents_cache(self, commits): + """When querying for a commit's parents, we fill the cache to a depth of 1000 + commits.""" + root_revs = self._encode_for_storage(commits) + for rev, parents in self.storage.revision_shortlog(root_revs, 1000): + rev_id = hashutil.hash_to_bytehex(rev) + if rev_id not in self._parents_cache: + self._parents_cache[rev_id] = [ + hashutil.hash_to_bytehex(parent) for parent in parents + ] + for rev in commits: + if rev not in self._parents_cache: + self._parents_cache[rev] = [] + + def _cache_heads(self, origin_id, base_snapshot): + """Return all the known head commits for `origin_id`""" + _git_types = ['content', 'directory', 'revision', 'release'] + + if not base_snapshot: + return [] + + snapshot_targets = set() + for target in base_snapshot['branches'].values(): + if target and target['target_type'] in _git_types: + snapshot_targets.add(target['target']) + + decoded_targets = self._decode_from_storage(snapshot_targets) + + for id, objs in self.get_stored_objects(decoded_targets).items(): + if not objs: + logging.warn('Missing head: %s' % hashutil.hash_to_hex(id)) + return [] + + return decoded_targets + + def get_parents(self, commit): + """Bogus method to prevent expensive recursion, at the expense of less + efficient downloading""" + return [] + + def get_heads(self): + return self.heads + + @staticmethod + def _encode_for_storage(objects): + return [hashutil.bytehex_to_hash(object) for object in objects] + + @staticmethod + def _decode_from_storage(objects): + return set(hashutil.hash_to_bytehex(object) for object in objects) + + def graph_walker(self): + return ObjectStoreGraphWalker(self.get_heads(), self.get_parents) + + @staticmethod + def filter_unwanted_refs(refs): + """Filter the unwanted references from refs""" + ret = {} + for ref, val in refs.items(): + if ref.endswith(b'^{}'): + # Peeled refs make the git protocol explode + continue + elif ref.startswith(b'refs/pull/') and ref.endswith(b'/merge'): + # We filter-out auto-merged GitHub pull requests + continue + else: + ret[ref] = val - def prepare(self, origin_url, directory, visit_date): - self.repo = dulwich.repo.Repo(directory) + return ret - def iter_objects(self): - object_store = self.repo.object_store + def determine_wants(self, refs): + """Filter the remote references to figure out which ones + Software Heritage needs. + """ + if not refs: + return [] - for pack in object_store.packs: - objs = list(pack.index.iterentries()) - objs.sort(key=lambda x: x[1]) - for sha, offset, crc32 in objs: - yield hashutil.hash_to_bytehex(sha) + # Find what objects Software Heritage has + refs = self.find_remote_ref_types_in_swh(refs) - yield from object_store._iter_loose_objects() - yield from object_store._iter_alternate_objects() + # Cache the objects found in swh as existing heads + for target in refs.values(): + if target['target_type'] is not None: + self.heads.add(target['target']) - def _check(self, obj): - """Check the object's repository representation. + ret = set() + for target in self.filter_unwanted_refs(refs).values(): + if target['target_type'] is None: + # The target doesn't exist in Software Heritage, let's retrieve + # it. + ret.add(target['target']) - If any errors in check exists, an ObjectFormatException is - raised. + return list(ret) - Args: - obj (object): Dulwich object read from the repository. + def get_stored_objects(self, objects): + """Find which of these objects were stored in the archive. + Do the request in packets to avoid a server timeout. + """ + if self.ignore_history: + return {} + + packet_size = 1000 + + ret = {} + query = [] + for object in objects: + query.append(object) + if len(query) >= packet_size: + ret.update( + self.storage.object_find_by_sha1_git( + self._encode_for_storage(query) + ) + ) + query = [] + if query: + ret.update( + self.storage.object_find_by_sha1_git( + self._encode_for_storage(query) + ) + ) + return ret + + def find_remote_ref_types_in_swh(self, remote_refs): + """Parse the remote refs information and list the objects that exist in + Software Heritage. """ - obj.check() - from dulwich.objects import Commit, Tag - try: - # For additional checks on dulwich objects with date - # for now, only checks on *time - if isinstance(obj, Commit): - commit_time = obj._commit_time - utils.check_date_time(commit_time) - author_time = obj._author_time - utils.check_date_time(author_time) - elif isinstance(obj, Tag): - tag_time = obj._tag_time - utils.check_date_time(tag_time) - except Exception as e: - raise ObjectFormatException(e) - - def get_object(self, oid): - """Given an object id, return the object if it is found and not - malformed in some way. - Args: - oid (bytes): the object's identifier + all_objs = set(remote_refs.values()) - set(self._type_cache) + type_by_id = {} + + for id, objs in self.get_stored_objects(all_objs).items(): + id = hashutil.hash_to_bytehex(id) + if objs: + type_by_id[id] = objs[0]['type'] + + self._type_cache.update(type_by_id) + + ret = {} + for ref, id in remote_refs.items(): + ret[ref] = { + 'target': id, + 'target_type': self._type_cache.get(id), + } + return ret - Returns: - The object if found without malformation + +class GitLoader(UnbufferedLoader): + """A bulk loader for a git repository""" + CONFIG_BASE_FILENAME = 'loader/git' + + ADDITIONAL_CONFIG = { + 'pack_size_bytes': ('int', 4 * 1024 * 1024 * 1024), + } + + def __init__(self, repo_representation=RepoRepresentation, config=None): + """Initialize the bulk updater. + + Args: + repo_representation: swh's repository representation + which is in charge of filtering between known and remote + data. """ - try: - # some errors are raised when reading the object - obj = self.repo[oid] - # some we need to check ourselves - self._check(obj) - except KeyError: - _id = oid.decode('utf-8') - self.log.warn('object %s not found, skipping' % _id, - extra={ - 'swh_type': 'swh_loader_git_missing_object', - 'swh_object_id': _id, - 'origin_id': self.origin_id, - }) - return None - except ObjectFormatException: - _id = oid.decode('utf-8') - self.log.warn('object %s malformed, skipping' % _id, - extra={ - 'swh_type': 'swh_loader_git_missing_object', - 'swh_object_id': _id, - 'origin_id': self.origin_id, - }) - return None - except EmptyFileException: - _id = oid.decode('utf-8') - self.log.warn('object %s corrupted (empty file), skipping' % _id, - extra={ - 'swh_type': 'swh_loader_git_missing_object', - 'swh_object_id': _id, - 'origin_id': self.origin_id, - }) + super().__init__(logging_class='swh.loader.git.BulkLoader', + config=config) + self.repo_representation = repo_representation + + def fetch_pack_from_origin(self, origin_url, base_origin_id, + base_snapshot, do_activity): + """Fetch a pack from the origin""" + pack_buffer = BytesIO() + + base_repo = self.repo_representation( + storage=self.storage, + origin_id=base_origin_id, + base_snapshot=base_snapshot, + ignore_history=self.ignore_history, + ) + + client, path = dulwich.client.get_transport_and_path(origin_url, + thin_packs=False) + + size_limit = self.config['pack_size_bytes'] + + def do_pack(data, + pack_buffer=pack_buffer, + limit=size_limit, + origin_url=origin_url): + cur_size = pack_buffer.tell() + would_write = len(data) + if cur_size + would_write > limit: + raise IOError('Pack file too big for repository %s, ' + 'limit is %d bytes, current size is %d, ' + 'would write %d' % + (origin_url, limit, cur_size, would_write)) + + pack_buffer.write(data) + + remote_refs = client.fetch_pack(path, + base_repo.determine_wants, + base_repo.graph_walker(), + do_pack, + progress=do_activity).refs + + if remote_refs: + local_refs = base_repo.find_remote_ref_types_in_swh(remote_refs) else: - return obj + local_refs = remote_refs = {} + + pack_buffer.flush() + pack_size = pack_buffer.tell() + pack_buffer.seek(0) + + return { + 'remote_refs': base_repo.filter_unwanted_refs(remote_refs), + 'local_refs': local_refs, + 'pack_buffer': pack_buffer, + 'pack_size': pack_size, + } + + def list_pack(self, pack_data, pack_size): + id_to_type = {} + type_to_ids = defaultdict(set) + + inflater = self.get_inflater() + + for obj in inflater: + type, id = obj.type_name, obj.id + id_to_type[id] = type + type_to_ids[type].add(id) + + return id_to_type, type_to_ids + + def prepare_origin_visit(self, origin_url, **kwargs): + self.visit_date = datetime.datetime.now(tz=datetime.timezone.utc) + self.origin = converters.origin_url_to_origin(origin_url) + + def get_full_snapshot(self, origin_id): + prev_snapshot = self.storage.snapshot_get_latest(origin_id) + if prev_snapshot and prev_snapshot.pop('next_branch', None): + return snapshot_get_all_branches(self.storage, prev_snapshot['id']) + + return prev_snapshot + + def prepare(self, origin_url, base_url=None, ignore_history=False): + base_origin_id = origin_id = self.origin_id + + prev_snapshot = None + + if not ignore_history: + prev_snapshot = self.get_full_snapshot(origin_id) + + if base_url and not prev_snapshot: + base_origin = converters.origin_url_to_origin(base_url) + base_origin = self.storage.origin_get(base_origin) + if base_origin: + base_origin_id = base_origin['id'] + prev_snapshot = self.get_full_snapshot(base_origin_id) + + self.base_snapshot = prev_snapshot + self.base_origin_id = base_origin_id + self.ignore_history = ignore_history def fetch_data(self): - """Fetch the data from the data source""" - self.previous_snapshot = self.storage.snapshot_get_latest( - self.origin_id - ) + def do_progress(msg): + sys.stderr.buffer.write(msg) + sys.stderr.flush() - type_to_ids = defaultdict(list) - for oid in self.iter_objects(): - obj = self.get_object(oid) - if not obj: - continue - type_name = obj.type_name - type_to_ids[type_name].append(oid) + fetch_info = self.fetch_pack_from_origin( + self.origin['url'], self.base_origin_id, self.base_snapshot, + do_progress) + + self.pack_buffer = fetch_info['pack_buffer'] + self.pack_size = fetch_info['pack_size'] + + self.remote_refs = fetch_info['remote_refs'] + self.local_refs = fetch_info['local_refs'] + origin_url = self.origin['url'] + + self.log.info('Listed %d refs for repo %s' % ( + len(self.remote_refs), origin_url), extra={ + 'swh_type': 'git_repo_list_refs', + 'swh_repo': origin_url, + 'swh_num_refs': len(self.remote_refs), + }) + + # We want to load the repository, walk all the objects + id_to_type, type_to_ids = self.list_pack(self.pack_buffer, + self.pack_size) + + self.id_to_type = id_to_type self.type_to_ids = type_to_ids + def save_data(self): + """Store a pack for archival""" + + write_size = 8192 + pack_dir = self.get_save_data_path() + + pack_name = "%s.pack" % self.visit_date.isoformat() + refs_name = "%s.refs" % self.visit_date.isoformat() + + with open(os.path.join(pack_dir, pack_name), 'xb') as f: + self.pack_buffer.seek(0) + while True: + r = self.pack_buffer.read(write_size) + if not r: + break + f.write(r) + + self.pack_buffer.seek(0) + + with open(os.path.join(pack_dir, refs_name), 'xb') as f: + pickle.dump(self.remote_refs, f) + + def get_inflater(self): + """Reset the pack buffer and get an object inflater from it""" + self.pack_buffer.seek(0) + return PackInflater.for_pack_data( + PackData.from_file(self.pack_buffer, self.pack_size)) + def has_contents(self): - """Checks whether we need to load contents""" return bool(self.type_to_ids[b'blob']) def get_content_ids(self): """Get the content identifiers from the git repository""" - for oid in self.type_to_ids[b'blob']: - yield converters.dulwich_blob_to_content_id(self.repo[oid]) + for raw_obj in self.get_inflater(): + if raw_obj.type_name != b'blob': + continue + + yield converters.dulwich_blob_to_content_id(raw_obj) def get_contents(self): - """Get the contents that need to be loaded""" + """Format the blobs from the git repository as swh contents""" max_content_size = self.config['content_size_limit'] missing_contents = set(self.storage.content_missing( self.get_content_ids(), 'sha1_git')) - for oid in missing_contents: + for raw_obj in self.get_inflater(): + if raw_obj.type_name != b'blob': + continue + + if raw_obj.sha().digest() not in missing_contents: + continue + yield converters.dulwich_blob_to_content( - self.repo[hashutil.hash_to_bytehex(oid)], log=self.log, - max_content_size=max_content_size, + raw_obj, log=self.log, max_content_size=max_content_size, origin_id=self.origin_id) def has_directories(self): - """Checks whether we need to load directories""" return bool(self.type_to_ids[b'tree']) def get_directory_ids(self): """Get the directory identifiers from the git repository""" return (hashutil.hash_to_bytes(id.decode()) for id in self.type_to_ids[b'tree']) def get_directories(self): - """Get the directories that need to be loaded""" + """Format the trees as swh directories""" missing_dirs = set(self.storage.directory_missing( sorted(self.get_directory_ids()))) - for oid in missing_dirs: - yield converters.dulwich_tree_to_directory( - self.repo[hashutil.hash_to_bytehex(oid)], log=self.log) + for raw_obj in self.get_inflater(): + if raw_obj.type_name != b'tree': + continue + + if raw_obj.sha().digest() not in missing_dirs: + continue + + yield converters.dulwich_tree_to_directory(raw_obj, log=self.log) def has_revisions(self): - """Checks whether we need to load revisions""" return bool(self.type_to_ids[b'commit']) def get_revision_ids(self): """Get the revision identifiers from the git repository""" return (hashutil.hash_to_bytes(id.decode()) for id in self.type_to_ids[b'commit']) def get_revisions(self): - """Get the revisions that need to be loaded""" + """Format commits as swh revisions""" missing_revs = set(self.storage.revision_missing( sorted(self.get_revision_ids()))) - for oid in missing_revs: - yield converters.dulwich_commit_to_revision( - self.repo[hashutil.hash_to_bytehex(oid)], log=self.log) + for raw_obj in self.get_inflater(): + if raw_obj.type_name != b'commit': + continue + + if raw_obj.sha().digest() not in missing_revs: + continue + + yield converters.dulwich_commit_to_revision(raw_obj, log=self.log) def has_releases(self): - """Checks whether we need to load releases""" return bool(self.type_to_ids[b'tag']) def get_release_ids(self): """Get the release identifiers from the git repository""" return (hashutil.hash_to_bytes(id.decode()) for id in self.type_to_ids[b'tag']) def get_releases(self): - """Get the releases that need to be loaded""" + """Retrieve all the release objects from the git repository""" missing_rels = set(self.storage.release_missing( sorted(self.get_release_ids()))) - for oid in missing_rels: - yield converters.dulwich_tag_to_release( - self.repo[hashutil.hash_to_bytehex(oid)], log=self.log) + for raw_obj in self.get_inflater(): + if raw_obj.type_name != b'tag': + continue + + if raw_obj.sha().digest() not in missing_rels: + continue + + yield converters.dulwich_tag_to_release(raw_obj, log=self.log) def get_snapshot(self): - """Turn the list of branches into a snapshot to load""" branches = {} - for ref, target in self.repo.refs.as_dict().items(): - obj = self.get_object(target) - if obj: - branches[ref] = { - 'target': hashutil.bytehex_to_hash(target), - 'target_type': converters.DULWICH_TYPES[obj.type_name], - } - else: - branches[ref] = None + for ref in self.remote_refs: + ret_ref = self.local_refs[ref].copy() + if not ret_ref['target_type']: + target_type = self.id_to_type[ret_ref['target']] + ret_ref['target_type'] = converters.DULWICH_TYPES[target_type] + + ret_ref['target'] = hashutil.bytehex_to_hash(ret_ref['target']) + + branches[ref] = ret_ref self.snapshot = converters.branches_to_snapshot(branches) return self.snapshot def get_fetch_history_result(self): - """Return the data to store in fetch_history for the current loader""" return { 'contents': len(self.type_to_ids[b'blob']), 'directories': len(self.type_to_ids[b'tree']), 'revisions': len(self.type_to_ids[b'commit']), 'releases': len(self.type_to_ids[b'tag']), } - def save_data(self): - """We already have the data locally, no need to save it""" - pass - def load_status(self): - """The load was eventful if the current occurrences are different to - the ones we retrieved at the beginning of the run""" + """The load was eventful if the current snapshot is different to + the one we retrieved at the beginning of the run""" eventful = False - if self.previous_snapshot: - eventful = self.snapshot['id'] != self.previous_snapshot['id'] + if self.base_snapshot: + eventful = self.snapshot['id'] != self.base_snapshot['id'] else: eventful = bool(self.snapshot['branches']) return {'status': ('eventful' if eventful else 'uneventful')} -class GitLoaderFromArchive(GitLoader): - """Load a git repository from an archive. - - """ - def project_name_from_archive(self, archive_path): - """Compute the project name from the archive's path. - - """ - return os.path.basename(os.path.dirname(archive_path)) - - def prepare(self, origin_url, archive_path, visit_date): - """1. Uncompress the archive in temporary location. - 2. Prepare as the GitLoader does - 3. Load as GitLoader does - - """ - project_name = self.project_name_from_archive(archive_path) - self.temp_dir, self.repo_path = utils.init_git_repo_from_archive( - project_name, archive_path) - - self.log.info('Project %s - Uncompressing archive %s at %s' % ( - origin_url, os.path.basename(archive_path), self.repo_path)) - super().prepare(origin_url, self.repo_path, visit_date) - - def cleanup(self): - """Cleanup the temporary location (if it exists). - - """ - if self.temp_dir and os.path.exists(self.temp_dir): - shutil.rmtree(self.temp_dir) - self.log.info('Project %s - Done injecting %s' % ( - self.origin_url, self.repo_path)) - - if __name__ == '__main__': import click - import logging logging.basicConfig( level=logging.DEBUG, format='%(asctime)s %(process)d %(message)s' ) @click.command() - @click.option('--origin-url', help='origin url') - @click.option('--git-directory', help='Path to git repository to load') - @click.option('--visit-date', default=None, help='Visit date') - def main(origin_url, git_directory, visit_date): - if not visit_date: - visit_date = datetime.datetime.now(tz=datetime.timezone.utc) - - return GitLoader().load(origin_url, git_directory, visit_date) + @click.option('--origin-url', help='Origin url', required=True) + @click.option('--base-url', default=None, help='Optional Base url') + @click.option('--ignore-history/--no-ignore-history', + help='Ignore the repository history', default=False) + def main(origin_url, base_url, ignore_history): + return GitLoader().load( + origin_url, + base_url=base_url, + ignore_history=ignore_history, + ) main() diff --git a/swh/loader/git/reader.py b/swh/loader/git/reader.py deleted file mode 100644 index 2da2b7f..0000000 --- a/swh/loader/git/reader.py +++ /dev/null @@ -1,258 +0,0 @@ -# Copyright (C) 2016-2018 The Software Heritage developers -# See the AUTHORS file at the top-level directory of this distribution -# License: GNU General Public License version 3, or any later version -# See top-level LICENSE file for more information - -from collections import defaultdict -import logging -import pprint - -import click - -from swh.core import utils -from swh.model.hashutil import MultiHash, hash_to_hex - -from .updater import BulkUpdater, SWHRepoRepresentation -from . import converters - - -class SWHRepoFullRepresentation(SWHRepoRepresentation): - """Overridden representation of a swh repository to permit to read - completely the remote repository. - - """ - def __init__(self, storage, origin_id, occurrences=None): - self.storage = storage - self._parents_cache = {} - self._type_cache = {} - self.heads = set() - - def determine_wants(self, refs): - """Filter the remote references to figure out which ones Software - Heritage needs. In this particular context, we want to know - everything. - - """ - if not refs: - return [] - - for target in refs.values(): - self.heads.add(target) - - return self.filter_unwanted_refs(refs).values() - - def find_remote_ref_types_in_swh(self, remote_refs): - """Find the known swh remote. - In that particular context, we know nothing. - - """ - return {} - - -class DummyGraphWalker(object): - """Dummy graph walker which claims that the client doesn’t have any - objects. - - """ - def ack(self, sha): pass - - def next(self): pass - - def __next__(self): pass - - -class BaseGitRemoteReader(BulkUpdater): - CONFIG_BASE_FILENAME = 'loader/git-remote-reader' - - ADDITIONAL_CONFIG = { - 'pack_size_bytes': ('int', 4 * 1024 * 1024 * 1024), - 'pack_storage_base': ('str', ''), # don't want to store packs so empty - 'next_task': ( - 'dict', { - 'queue': 'swh.storage.archiver.tasks.SWHArchiverToBackendTask', - 'batch_size': 100, - 'destination': 'azure' - } - ) - } - - def __init__(self): - super().__init__(SWHRepoFullRepresentation) - self.next_task = self.config['next_task'] - self.batch_size = self.next_task['batch_size'] - self.task_destination = self.next_task['queue'] - self.destination = self.next_task['destination'] - - def graph_walker(self): - return DummyGraphWalker() - - def prepare_origin_visit(self, origin_url, base_url=None): - self.origin = converters.origin_url_to_origin(origin_url) - self.origin_id = 0 - - def prepare(self, origin_url, base_url=None): - """Only retrieve information about the origin, set everything else to - empty. - - """ - self.base_occurrences = [] - self.base_origin_id = 0 - - def keep_object(self, obj): - """Do we want to keep this object or not?""" - raise NotImplementedError('Please implement keep_object') - - def get_id_and_data(self, obj): - """Get the id, type and data of the given object""" - raise NotImplementedError('Please implement get_id_and_data') - - def list_pack(self, pack_data, pack_size): - """Override list_pack to only keep contents' sha1. - - Returns: - id_to_type (dict): keys are sha1, values are their associated type - type_to_ids (dict): keys are types, values are list of associated - data (sha1 for blobs) - - """ - self.data = {} - id_to_type = {} - type_to_ids = defaultdict(set) - - inflater = self.get_inflater() - - for obj in inflater: - if not self.keep_object(obj): - continue - - object_id, type, data = self.get_id_and_data(obj) - - id_to_type[object_id] = type - type_to_ids[type].add(object_id) - self.data[object_id] = data - - return id_to_type, type_to_ids - - def load(self, *args, **kwargs): - """Override the loading part which simply reads the repository's - contents' sha1. - - Returns: - Returns the list of discovered sha1s for that origin. - - """ - self.prepare(*args, **kwargs) - self.fetch_data() - - -class GitSha1RemoteReader(BaseGitRemoteReader): - """Read sha1 git from a remote repository and dump only repository's - content sha1 as list. - - """ - def keep_object(self, obj): - """Only keep blobs""" - return obj.type_name == b'blob' - - def get_id_and_data(self, obj): - """We want to store only object identifiers""" - # compute the sha1 (obj.id is the sha1_git) - data = obj.as_raw_string() - hashes = MultiHash.from_data(data, {'sha1'}).digest() - oid = hashes['sha1'] - return (oid, b'blob', oid) - - -class GitSha1RemoteReaderAndSendToQueue(GitSha1RemoteReader): - """Read sha1 git from a remote repository and dump only repository's - content sha1 as list and send batch of those sha1s to a celery - queue for consumption. - - """ - def load(self, *args, **kwargs): - """Retrieve the list of sha1s for a particular origin and send those - sha1s as group of sha1s to a specific queue. - - """ - super().load(*args, **kwargs) - - data = self.type_to_ids[b'blob'] - - from swh.scheduler.celery_backend.config import app - try: - # optional dependency - from swh.storage.archiver import tasks # noqa - except ImportError: - pass - from celery import group - - task_destination = app.tasks[self.task_destination] - groups = [] - for ids in utils.grouper(data, self.batch_size): - sig_ids = task_destination.s(destination=self.destination, - batch=list(ids)) - groups.append(sig_ids) - group(groups).delay() - return data - - -class GitCommitRemoteReader(BaseGitRemoteReader): - def keep_object(self, obj): - return obj.type_name == b'commit' - - def get_id_and_data(self, obj): - return obj.id, b'commit', converters.dulwich_commit_to_revision(obj) - - def load(self, *args, **kwargs): - super().load(*args, **kwargs) - return self.data - - -@click.group() -@click.option('--origin-url', help='Origin url') -@click.pass_context -def main(ctx, origin_url): - logging.basicConfig( - level=logging.DEBUG, - format='%(asctime)s %(process)d %(message)s' - ) - ctx.obj['origin_url'] = origin_url - - -@main.command() -@click.option('--send/--nosend', default=False, help='Origin\'s url') -@click.pass_context -def blobs(ctx, send): - origin_url = ctx.obj['origin_url'] - - if send: - loader = GitSha1RemoteReaderAndSendToQueue() - ids = loader.load(origin_url) - print('%s sha1s were sent to queue' % len(ids)) - return - - loader = GitSha1RemoteReader() - ids = loader.load(origin_url) - - if ids: - for oid in ids: - print(hash_to_hex(oid)) - - -@main.command() -@click.option('--ids-only', is_flag=True, help='print ids only') -@click.pass_context -def commits(ctx, ids_only): - origin_url = ctx.obj['origin_url'] - - reader = GitCommitRemoteReader() - commits = reader.load(origin_url) - for commit_id, commit in commits.items(): - if ids_only: - print(commit_id.decode()) - else: - pprint.pprint(commit) - - -if __name__ == '__main__': - main(obj={}) diff --git a/swh/loader/git/tasks.py b/swh/loader/git/tasks.py index 5eefff1..1ad2e06 100644 --- a/swh/loader/git/tasks.py +++ b/swh/loader/git/tasks.py @@ -1,70 +1,43 @@ -# Copyright (C) 2015-2017 The Software Heritage developers +# Copyright (C) 2015-2018 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import dateutil.parser -from swh.scheduler.task import Task +from celery import current_app as app -from .loader import GitLoader, GitLoaderFromArchive -from .updater import BulkUpdater -from .reader import GitSha1RemoteReaderAndSendToQueue +from swh.loader.git.from_disk import GitLoaderFromDisk, GitLoaderFromArchive +from swh.loader.git.loader import GitLoader -# TODO: rename to LoadRemoteGitRepository -class UpdateGitRepository(Task): +@app.task(name=__name__ + '.UpdateGitRepository') +def update_git_repository(repo_url, base_url=None): """Import a git repository from a remote location""" - task_queue = 'swh_loader_git' + loader = GitLoader() + return loader.load(repo_url, base_url=base_url) - def run_task(self, repo_url, base_url=None): - """Import a git repository""" - loader = BulkUpdater() - loader.log = self.log - return loader.load(repo_url, base_url=base_url) +@app.task(name=__name__ + '.LoadDiskGitRepository') +def load_disk_git_repository(origin_url, directory, date): + """Import a git repository from disk + Import a git repository, cloned in `directory` from `origin_url` at + `date`. -class LoadDiskGitRepository(Task): - """Import a git repository from disk""" - task_queue = 'swh_loader_git_express' + """ + loader = GitLoaderFromDisk() + return loader.load(origin_url, directory, dateutil.parser.parse(date)) - def run_task(self, origin_url, directory, date): - """Import a git repository, cloned in `directory` from `origin_url` at - `date`.""" - loader = GitLoader() - loader.log = self.log +@app.task(name=__name__ + '.UncompressAndLoadDiskGitRepository') +def run_task(origin_url, archive_path, date): + """Import a git repository from a zip archive - return loader.load(origin_url, directory, dateutil.parser.parse(date)) - - -class UncompressAndLoadDiskGitRepository(Task): - """Import a git repository from a zip archive""" - task_queue = 'swh_loader_git_archive' - - def run_task(self, origin_url, archive_path, date): - """1. Uncompress an archive repository in a local and temporary folder - 2. Load it through the git disk loader - 3. Clean up the temporary folder - - """ - loader = GitLoaderFromArchive() - loader.log = self.log - - return loader.load( - origin_url, archive_path, dateutil.parser.parse(date)) - - -class ReaderGitRepository(Task): - task_queue = 'swh_reader_git' - - def run_task(self, repo_url, base_url=None): - """Read a git repository from a remote location and send sha1 to - archival. - - """ - loader = GitSha1RemoteReaderAndSendToQueue() - loader.log = self.log - - return loader.load(repo_url) + 1. Uncompress an archive repository in a local and temporary folder + 2. Load it through the git disk loader + 3. Clean up the temporary folder + """ + loader = GitLoaderFromArchive() + return loader.load( + origin_url, archive_path, dateutil.parser.parse(date)) diff --git a/swh/loader/git/tests/__init__.py b/swh/loader/git/tests/__init__.py index e69de29..a07e188 100644 --- a/swh/loader/git/tests/__init__.py +++ b/swh/loader/git/tests/__init__.py @@ -0,0 +1,21 @@ +TEST_LOADER_CONFIG = { + 'storage': { + 'cls': 'memory', + 'args': { + } + }, + 'send_contents': True, + 'send_directories': True, + 'send_revisions': True, + 'send_releases': True, + 'send_snapshot': True, + + 'content_size_limit': 100 * 1024 * 1024, + 'content_packet_size': 10, + 'content_packet_size_bytes': 100 * 1024 * 1024, + 'directory_packet_size': 10, + 'revision_packet_size': 10, + 'release_packet_size': 10, + + 'save_data': False, +} diff --git a/swh/loader/git/tests/conftest.py b/swh/loader/git/tests/conftest.py new file mode 100644 index 0000000..55e1b3e --- /dev/null +++ b/swh/loader/git/tests/conftest.py @@ -0,0 +1,10 @@ +import pytest + +from swh.scheduler.tests.conftest import * # noqa + + +@pytest.fixture(scope='session') +def celery_includes(): + return [ + 'swh.loader.git.tasks', + ] diff --git a/swh/loader/git/tests/test_converters.py b/swh/loader/git/tests/test_converters.py index 490b449..fd6753b 100644 --- a/swh/loader/git/tests/test_converters.py +++ b/swh/loader/git/tests/test_converters.py @@ -1,307 +1,332 @@ -# Copyright (C) 2015-2017 The Software Heritage developers +# Copyright (C) 2015-2018 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import os +import pytest import shutil import subprocess import tempfile import unittest import dulwich.repo -from nose.plugins.attrib import attr import swh.loader.git.converters as converters from swh.model.hashutil import bytehex_to_hash, hash_to_bytes TEST_DATA = os.path.join(os.path.dirname(__file__), 'data') class SWHTargetType: """Dulwich lookalike TargetType class """ def __init__(self, type_name): self.type_name = type_name class SWHTag: """Dulwich lookalike tag class """ def __init__(self, name, type_name, target, target_type, tagger, tag_time, tag_timezone, message): self.name = name self.type_name = type_name self.object = SWHTargetType(target_type), target self.tagger = tagger self._message = message self.tag_time = tag_time self.tag_timezone = tag_timezone self._tag_timezone_neg_utc = False def sha(self): from hashlib import sha1 return sha1() -@attr('fs') +@pytest.mark.fs class TestConverters(unittest.TestCase): @classmethod def setUpClass(cls): super().setUpClass() cls.repo_path = tempfile.mkdtemp() cls.repo = dulwich.repo.Repo.init_bare(cls.repo_path) fast_export = os.path.join( TEST_DATA, 'git-repos', 'example-submodule.fast-export.xz') xz = subprocess.Popen( ['xzcat'], stdin=open(fast_export, 'rb'), stdout=subprocess.PIPE, ) git = subprocess.Popen( ['git', 'fast-import', '--quiet'], stdin=xz.stdout, cwd=cls.repo_path, ) # flush stdout of xz xz.stdout.close() git.communicate() @classmethod def tearDownClass(cls): super().tearDownClass() shutil.rmtree(cls.repo_path) def setUp(self): super().setUp() self.blob_id = b'28c6f4023d65f74e3b59a2dea3c4277ed9ee07b0' self.blob = { 'sha1_git': bytehex_to_hash(self.blob_id), 'sha1': hash_to_bytes('4850a3420a2262ff061cb296fb915430fa92301c'), 'sha256': hash_to_bytes('fee7c8a485a10321ad94b64135073cb5' '5f22cb9f57fa2417d2adfb09d310adef'), 'blake2s256': hash_to_bytes('5d71873f42a137f6d89286e43677721e574' '1fa05ce4cd5e3c7ea7c44d4c2d10b'), 'data': (b'[submodule "example-dependency"]\n' b'\tpath = example-dependency\n' b'\turl = https://github.com/githubtraining/' b'example-dependency.git\n'), 'length': 124, 'status': 'visible', } self.blob_hidden = { 'sha1_git': bytehex_to_hash(self.blob_id), 'sha1': hash_to_bytes('4850a3420a2262ff061cb296fb915430fa92301c'), 'sha256': hash_to_bytes('fee7c8a485a10321ad94b64135073cb5' '5f22cb9f57fa2417d2adfb09d310adef'), 'blake2s256': hash_to_bytes('5d71873f42a137f6d89286e43677721e574' '1fa05ce4cd5e3c7ea7c44d4c2d10b'), 'length': 124, 'status': 'absent', 'reason': 'Content too large', 'origin': None, } def test_blob_to_content(self): content = converters.dulwich_blob_to_content(self.repo[self.blob_id]) self.assertEqual(self.blob, content) def test_blob_to_content_absent(self): max_length = self.blob['length'] - 1 content = converters.dulwich_blob_to_content( self.repo[self.blob_id], max_content_size=max_length) self.assertEqual(self.blob_hidden, content) + def test_convertion_wrong_input(self): + class Something: + type_name = b'something-not-the-right-type' + + m = { + 'blob': converters.dulwich_blob_to_content, + 'blob2': converters.dulwich_blob_to_content_id, + 'tree': converters.dulwich_tree_to_directory, + 'commit': converters.dulwich_tree_to_directory, + 'tag': converters.dulwich_tag_to_release, + } + + for _callable in m.values(): + self.assertIsNone(_callable(Something())) + def test_commit_to_revision(self): sha1 = b'9768d0b576dbaaecd80abedad6dfd0d72f1476da' revision = converters.dulwich_commit_to_revision(self.repo[sha1]) expected_revision = { 'id': hash_to_bytes('9768d0b576dbaaecd80abedad6dfd0d72f1476da'), 'directory': b'\xf0i\\./\xa7\xce\x9dW@#\xc3A7a\xa4s\xe5\x00\xca', 'type': 'git', 'committer': { 'name': b'Stefano Zacchiroli', 'fullname': b'Stefano Zacchiroli ', 'email': b'zack@upsilon.cc', }, 'author': { 'name': b'Stefano Zacchiroli', 'fullname': b'Stefano Zacchiroli ', 'email': b'zack@upsilon.cc', }, 'committer_date': { 'negative_utc': None, 'timestamp': 1443083765, 'offset': 120, }, 'message': b'add submodule dependency\n', 'metadata': None, 'date': { 'negative_utc': None, 'timestamp': 1443083765, 'offset': 120, }, 'parents': [ b'\xc3\xc5\x88q23`\x9f[\xbb\xb2\xd9\xe7\xf3\xfbJf\x0f?r' ], 'synthetic': False, } - self.assertEquals(revision, expected_revision) + self.assertEqual(revision, expected_revision) def test_author_line_to_author(self): + # edge case out of the way + self.assertIsNone(converters.parse_author(None)) + tests = { b'a ': { 'name': b'a', 'email': b'b@c.com', 'fullname': b'a ', }, b'': { 'name': None, 'email': b'foo@bar.com', 'fullname': b'', }, b'malformed ': { 'name': b'trailing', 'email': b'sp@c.e', 'fullname': b'trailing ', }, b'no': { 'name': b'no', 'email': b'sp@c.e', 'fullname': b'no', }, b' <>': { 'name': b'', 'email': b'', 'fullname': b' <>', }, + b'something': { + 'name': None, + 'email': None, + 'fullname': b'something' + } } for author in sorted(tests): parsed_author = tests[author] - self.assertEquals(parsed_author, - converters.parse_author(author)) + self.assertEqual(parsed_author, + converters.parse_author(author)) def test_dulwich_tag_to_release_no_author_no_date(self): target = b'641fb6e08ddb2e4fd096dcf18e80b894bf' message = b'some release message' tag = SWHTag(name='blah', type_name=b'tag', target=target, target_type=b'commit', message=message, tagger=None, tag_time=None, tag_timezone=None) # when actual_release = converters.dulwich_tag_to_release(tag) # then expected_release = { 'author': None, 'date': None, 'id': b'\xda9\xa3\xee^kK\r2U\xbf\xef\x95`\x18\x90\xaf\xd8\x07\t', 'message': message, 'metadata': None, 'name': 'blah', 'synthetic': False, 'target': hash_to_bytes(target.decode()), 'target_type': 'revision' } - self.assertEquals(actual_release, expected_release) + self.assertEqual(actual_release, expected_release) def test_dulwich_tag_to_release_author_and_date(self): tagger = b'hey dude ' target = b'641fb6e08ddb2e4fd096dcf18e80b894bf' message = b'some release message' import datetime - date = datetime.datetime(2007, 12, 5).timestamp() + date = datetime.datetime( + 2007, 12, 5, tzinfo=datetime.timezone.utc + ).timestamp() tag = SWHTag(name='blah', type_name=b'tag', target=target, target_type=b'commit', message=message, tagger=tagger, tag_time=date, tag_timezone=0) # when actual_release = converters.dulwich_tag_to_release(tag) # then expected_release = { 'author': { 'email': b'hello@mail.org', 'fullname': b'hey dude ', 'name': b'hey dude' }, 'date': { 'negative_utc': False, 'offset': 0, - 'timestamp': 1196809200.0 + 'timestamp': 1196812800.0 }, 'id': b'\xda9\xa3\xee^kK\r2U\xbf\xef\x95`\x18\x90\xaf\xd8\x07\t', 'message': message, 'metadata': None, 'name': 'blah', 'synthetic': False, 'target': hash_to_bytes(target.decode()), 'target_type': 'revision' } - self.assertEquals(actual_release, expected_release) + self.assertEqual(actual_release, expected_release) def test_dulwich_tag_to_release_author_no_date(self): # to reproduce bug T815 (fixed) tagger = b'hey dude ' target = b'641fb6e08ddb2e4fd096dcf18e80b894bf' message = b'some release message' tag = SWHTag(name='blah', type_name=b'tag', target=target, target_type=b'commit', message=message, tagger=tagger, tag_time=None, tag_timezone=None) # when actual_release = converters.dulwich_tag_to_release(tag) # then expected_release = { 'author': { 'email': b'hello@mail.org', 'fullname': b'hey dude ', 'name': b'hey dude' }, 'date': None, 'id': b'\xda9\xa3\xee^kK\r2U\xbf\xef\x95`\x18\x90\xaf\xd8\x07\t', 'message': message, 'metadata': None, 'name': 'blah', 'synthetic': False, 'target': hash_to_bytes(target.decode()), 'target_type': 'revision' } - self.assertEquals(actual_release, expected_release) + self.assertEqual(actual_release, expected_release) diff --git a/swh/loader/git/tests/test_from_disk.py b/swh/loader/git/tests/test_from_disk.py new file mode 100644 index 0000000..27cb713 --- /dev/null +++ b/swh/loader/git/tests/test_from_disk.py @@ -0,0 +1,263 @@ +# Copyright (C) 2018 The Software Heritage developers +# See the AUTHORS file at the top-level directory of this distribution +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + +import os.path +import subprocess + + +from swh.loader.git.from_disk import GitLoaderFromDisk, GitLoaderFromArchive +from swh.loader.core.tests import BaseLoaderTest + +from . import TEST_LOADER_CONFIG + + +class GitLoaderFromArchive(GitLoaderFromArchive): + def project_name_from_archive(self, archive_path): + # We don't want the project name to be 'resources'. + return 'testrepo' + + def parse_config_file(self, *args, **kwargs): + return TEST_LOADER_CONFIG + + +CONTENT1 = { + '33ab5639bfd8e7b95eb1d8d0b87781d4ffea4d5d', # README v1 + '349c4ff7d21f1ec0eda26f3d9284c293e3425417', # README v2 + '799c11e348d39f1704022b8354502e2f81f3c037', # file1.txt + '4bdb40dfd6ec75cb730e678b5d7786e30170c5fb', # file2.txt +} + +SNAPSHOT_ID = 'bdf3b06d6017e0d9ad6447a73da6ff1ae9efb8f0' + +SNAPSHOT1 = { + 'id': SNAPSHOT_ID, + 'branches': { + 'HEAD': { + 'target': '2f01f5ca7e391a2f08905990277faf81e709a649', + 'target_type': 'revision', + }, + 'refs/heads/master': { + 'target': '2f01f5ca7e391a2f08905990277faf81e709a649', + 'target_type': 'revision', + }, + 'refs/heads/branch1': { + 'target': 'b0a77609903f767a2fd3d769904ef9ef68468b87', + 'target_type': 'revision', + }, + 'refs/heads/branch2': { + 'target': 'bd746cd1913721b269b395a56a97baf6755151c2', + 'target_type': 'revision', + }, + 'refs/tags/branch2-after-delete': { + 'target': 'bd746cd1913721b269b395a56a97baf6755151c2', + 'target_type': 'revision', + }, + 'refs/tags/branch2-before-delete': { + 'target': '1135e94ccf73b5f9bd6ef07b3fa2c5cc60bba69b', + 'target_type': 'revision', + }, + }, +} + +# directory hashes obtained with: +# gco b6f40292c4e94a8f7e7b4aff50e6c7429ab98e2a +# swh-hashtree --ignore '.git' --path . +# gco 2f01f5ca7e391a2f08905990277faf81e709a649 +# swh-hashtree --ignore '.git' --path . +# gco bcdc5ebfde1a3cd6c96e0c2ea4eed19c13208777 +# swh-hashtree --ignore '.git' --path . +# gco 1135e94ccf73b5f9bd6ef07b3fa2c5cc60bba69b +# swh-hashtree --ignore '.git' --path . +# gco 79f65ac75f79dda6ff03d66e1242702ab67fb51c +# swh-hashtree --ignore '.git' --path . +# gco b0a77609903f767a2fd3d769904ef9ef68468b87 +# swh-hashtree --ignore '.git' --path . +# gco bd746cd1913721b269b395a56a97baf6755151c2 +# swh-hashtree --ignore '.git' --path . +REVISIONS1 = { + 'b6f40292c4e94a8f7e7b4aff50e6c7429ab98e2a': + '40dbdf55dfd4065422462cc74a949254aefa972e', + '2f01f5ca7e391a2f08905990277faf81e709a649': + 'e1d0d894835f91a0f887a4bc8b16f81feefdfbd5', + 'bcdc5ebfde1a3cd6c96e0c2ea4eed19c13208777': + 'b43724545b4759244bb54be053c690649161411c', + '1135e94ccf73b5f9bd6ef07b3fa2c5cc60bba69b': + 'fbf70528223d263661b5ad4b80f26caf3860eb8e', + '79f65ac75f79dda6ff03d66e1242702ab67fb51c': + '5df34ec74d6f69072d9a0a6677d8efbed9b12e60', + 'b0a77609903f767a2fd3d769904ef9ef68468b87': + '9ca0c7d6ffa3f9f0de59fd7912e08f11308a1338', + 'bd746cd1913721b269b395a56a97baf6755151c2': + 'e1d0d894835f91a0f887a4bc8b16f81feefdfbd5', +} + + +class BaseGitLoaderFromDiskTest(BaseLoaderTest): + def setUp(self, archive_name, uncompress_archive, filename='testrepo'): + super().setUp(archive_name=archive_name, filename=filename, + prefix_tmp_folder_name='swh.loader.git.', + start_path=os.path.dirname(__file__), + uncompress_archive=uncompress_archive) + + +class GitLoaderFromDiskTest(GitLoaderFromDisk): + def parse_config_file(self, *args, **kwargs): + return TEST_LOADER_CONFIG + + +class BaseDirGitLoaderFromDiskTest(BaseGitLoaderFromDiskTest): + """Mixin base loader test to prepare the git + repository to uncompress, load and test the results. + + This sets up + + """ + def setUp(self): + super().setUp('testrepo.tgz', uncompress_archive=True) + self.loader = GitLoaderFromDiskTest() + self.storage = self.loader.storage + + def load(self): + return self.loader.load( + origin_url=self.repo_url, + visit_date='2016-05-03 15:16:32+00', + directory=self.destination_path) + + +class BaseGitLoaderFromArchiveTest(BaseGitLoaderFromDiskTest): + """Mixin base loader test to prepare the git + repository to uncompress, load and test the results. + + This sets up + + """ + def setUp(self): + super().setUp('testrepo.tgz', uncompress_archive=False) + self.loader = GitLoaderFromArchive() + self.storage = self.loader.storage + + def load(self): + return self.loader.load( + origin_url=self.repo_url, + visit_date='2016-05-03 15:16:32+00', + archive_path=self.destination_path) + + +class GitLoaderFromDiskTests: + """Common tests for all git loaders.""" + def test_load(self): + """Loads a simple repository (made available by `setUp()`), + and checks everything was added in the storage.""" + res = self.load() + self.assertEqual(res['status'], 'eventful', res) + + self.assertContentsContain(CONTENT1) + self.assertCountDirectories(7) + self.assertCountReleases(0) # FIXME: why not 2? + self.assertCountRevisions(7) + self.assertCountSnapshots(1) + + self.assertRevisionsContain(REVISIONS1) + + self.assertSnapshotEqual(SNAPSHOT1) + + self.assertEqual(self.loader.load_status(), {'status': 'eventful'}) + self.assertEqual(self.loader.visit_status(), 'full') + + def test_load_unchanged(self): + """Checks loading a repository a second time does not add + any extra data.""" + res = self.load() + self.assertEqual(res['status'], 'eventful') + + res = self.load() + self.assertEqual(res['status'], 'uneventful') + self.assertCountSnapshots(1) + + +class DirGitLoaderTest(BaseDirGitLoaderFromDiskTest, GitLoaderFromDiskTests): + """Tests for the GitLoaderFromDisk. Includes the common ones, and + add others that only work with a local dir.""" + + def _git(self, *cmd): + """Small wrapper around subprocess to call Git.""" + try: + return subprocess.check_output( + ['git', '-C', self.destination_path] + list(cmd)) + except subprocess.CalledProcessError as e: + print(e.output) + print(e.stderr) + raise + + def test_load_changed(self): + """Loads a repository, makes some changes by adding files, commits, + and merges, load it again, and check the storage contains everything + it should.""" + # Initial load + res = self.load() + self.assertEqual(res['status'], 'eventful', res) + + self._git('config', '--local', 'user.email', 'you@example.com') + self._git('config', '--local', 'user.name', 'Your Name') + + # Load with a new file + revision + with open(os.path.join(self.destination_path, 'hello.py'), 'a') as fd: + fd.write("print('Hello world')\n") + + self._git('add', 'hello.py') + self._git('commit', '-m', 'Hello world') + new_revision = self._git('rev-parse', 'master').decode().strip() + + revisions = REVISIONS1.copy() + assert new_revision not in revisions + revisions[new_revision] = '85dae072a5aa9923ffa7a7568f819ff21bf49858' + + res = self.load() + self.assertEqual(res['status'], 'eventful') + + self.assertCountContents(4 + 1) + self.assertCountDirectories(7 + 1) + self.assertCountReleases(0) # FIXME: why not 2? + self.assertCountRevisions(7 + 1) + self.assertCountSnapshots(1 + 1) + + self.assertRevisionsContain(revisions) + + # TODO: how to check the snapshot id? + # self.assertSnapshotEqual(SNAPSHOT1) + + self.assertEqual(self.loader.load_status(), {'status': 'eventful'}) + self.assertEqual(self.loader.visit_status(), 'full') + + # Load with a new merge + self._git('merge', 'branch1', '-m', 'merge') + new_revision = self._git('rev-parse', 'master').decode().strip() + + assert new_revision not in revisions + revisions[new_revision] = 'dab8a37df8db8666d4e277bef9a546f585b5bedd' + + res = self.load() + self.assertEqual(res['status'], 'eventful') + + self.assertCountContents(4 + 1) + self.assertCountDirectories(7 + 2) + self.assertCountReleases(0) # FIXME: why not 2? + self.assertCountRevisions(7 + 2) + self.assertCountSnapshots(1 + 1 + 1) + + self.assertRevisionsContain(revisions) + + # TODO: how to check the snapshot id? + # self.assertSnapshotEqual(SNAPSHOT1) + + self.assertEqual(self.loader.load_status(), {'status': 'eventful'}) + self.assertEqual(self.loader.visit_status(), 'full') + + +class GitLoaderFromArchiveTest(BaseGitLoaderFromArchiveTest, + GitLoaderFromDiskTests): + """Tests for GitLoaderFromArchive. Imports the common ones + from GitLoaderTests.""" + pass diff --git a/swh/loader/git/tests/test_loader.py b/swh/loader/git/tests/test_loader.py new file mode 100644 index 0000000..1ef1457 --- /dev/null +++ b/swh/loader/git/tests/test_loader.py @@ -0,0 +1,28 @@ +# Copyright (C) 2018 The Software Heritage developers +# See the AUTHORS file at the top-level directory of this distribution +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + + +from swh.loader.git.loader import GitLoader +from swh.loader.git.tests.test_from_disk import DirGitLoaderTest + + +class GitLoaderTest(GitLoader): + def parse_config_file(self, *args, **kwargs): + return { + **super().parse_config_file(*args, **kwargs), + 'storage': {'cls': 'memory', 'args': {}} + } + + +class TestGitLoader(DirGitLoaderTest): + """Same tests as for the GitLoaderFromDisk, but running on GitLoader.""" + def setUp(self): + super().setUp() + self.loader = GitLoaderTest() + self.storage = self.loader.storage + + def load(self): + return self.loader.load( + origin_url=self.repo_url) diff --git a/swh/loader/git/tests/test_tasks.py b/swh/loader/git/tests/test_tasks.py new file mode 100644 index 0000000..ab05a06 --- /dev/null +++ b/swh/loader/git/tests/test_tasks.py @@ -0,0 +1,54 @@ +# Copyright (C) 2018 The Software Heritage developers +# See the AUTHORS file at the top-level directory of this distribution +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + +import datetime +from unittest.mock import patch + + +@patch('swh.loader.git.loader.GitLoader.load') +def test_git_loader(mock_loader, swh_app, celery_session_worker): + mock_loader.return_value = {'status': 'eventful'} + + res = swh_app.send_task( + 'swh.loader.git.tasks.UpdateGitRepository', + ('origin_url',)) + assert res + res.wait() + assert res.successful() + + assert res.result == {'status': 'eventful'} + mock_loader.assert_called_once_with('origin_url', base_url=None) + + +@patch('swh.loader.git.from_disk.GitLoaderFromDisk.load') +def test_git_loader_from_disk(mock_loader, swh_app, celery_session_worker): + mock_loader.return_value = {'status': 'uneventful'} + + res = swh_app.send_task( + 'swh.loader.git.tasks.LoadDiskGitRepository', + ('origin_url2', '/some/repo', '2018-12-10 00:00')) + assert res + res.wait() + assert res.successful() + + assert res.result == {'status': 'uneventful'} + mock_loader.assert_called_once_with( + 'origin_url2', '/some/repo', datetime.datetime(2018, 12, 10, 0, 0)) + + +@patch('swh.loader.git.from_disk.GitLoaderFromArchive.load') +def test_git_loader_from_archive(mock_loader, swh_app, celery_session_worker): + mock_loader.return_value = {'status': 'failed'} + + res = swh_app.send_task( + 'swh.loader.git.tasks.UncompressAndLoadDiskGitRepository', + ('origin_url3', '/some/repo', '2017-01-10 00:00')) + assert res + res.wait() + assert res.successful() + + assert res.result == {'status': 'failed'} + mock_loader.assert_called_once_with( + 'origin_url3', '/some/repo', datetime.datetime(2017, 1, 10, 0, 0)) diff --git a/swh/loader/git/updater.py b/swh/loader/git/updater.py deleted file mode 100644 index 2f1f3f8..0000000 --- a/swh/loader/git/updater.py +++ /dev/null @@ -1,493 +0,0 @@ -# Copyright (C) 2016-2018 The Software Heritage developers -# See the AUTHORS file at the top-level directory of this distribution -# License: GNU General Public License version 3, or any later version -# See top-level LICENSE file for more information - -import datetime -import dulwich.client -import logging -import os -import pickle -import sys - -from collections import defaultdict -from io import BytesIO -from dulwich.object_store import ObjectStoreGraphWalker -from dulwich.pack import PackData, PackInflater - -from swh.model import hashutil -from swh.loader.core.loader import SWHStatelessLoader -from swh.storage.algos.snapshot import snapshot_get_all_branches -from . import converters - - -class SWHRepoRepresentation: - """Repository representation for a Software Heritage origin.""" - def __init__(self, storage, origin_id, base_snapshot=None, - ignore_history=False): - self.storage = storage - - self._parents_cache = {} - self._type_cache = {} - - self.ignore_history = ignore_history - - if origin_id and not ignore_history: - self.heads = set(self._cache_heads(origin_id, base_snapshot)) - else: - self.heads = set() - - def _fill_parents_cache(self, commits): - """When querying for a commit's parents, we fill the cache to a depth of 1000 - commits.""" - root_revs = self._encode_for_storage(commits) - for rev, parents in self.storage.revision_shortlog(root_revs, 1000): - rev_id = hashutil.hash_to_bytehex(rev) - if rev_id not in self._parents_cache: - self._parents_cache[rev_id] = [ - hashutil.hash_to_bytehex(parent) for parent in parents - ] - for rev in commits: - if rev not in self._parents_cache: - self._parents_cache[rev] = [] - - def _cache_heads(self, origin_id, base_snapshot): - """Return all the known head commits for `origin_id`""" - _git_types = ['content', 'directory', 'revision', 'release'] - - if not base_snapshot: - return [] - - snapshot_targets = set() - for target in base_snapshot['branches'].values(): - if target and target['target_type'] in _git_types: - snapshot_targets.add(target['target']) - - decoded_targets = self._decode_from_storage(snapshot_targets) - - for id, objs in self.get_stored_objects(decoded_targets).items(): - if not objs: - logging.warn('Missing head: %s' % hashutil.hash_to_hex(id)) - return [] - - return decoded_targets - - def get_parents(self, commit): - """Bogus method to prevent expensive recursion, at the expense of less - efficient downloading""" - return [] - - def get_heads(self): - return self.heads - - @staticmethod - def _encode_for_storage(objects): - return [hashutil.bytehex_to_hash(object) for object in objects] - - @staticmethod - def _decode_from_storage(objects): - return set(hashutil.hash_to_bytehex(object) for object in objects) - - def graph_walker(self): - return ObjectStoreGraphWalker(self.get_heads(), self.get_parents) - - @staticmethod - def filter_unwanted_refs(refs): - """Filter the unwanted references from refs""" - ret = {} - for ref, val in refs.items(): - if ref.endswith(b'^{}'): - # Peeled refs make the git protocol explode - continue - elif ref.startswith(b'refs/pull/') and ref.endswith(b'/merge'): - # We filter-out auto-merged GitHub pull requests - continue - else: - ret[ref] = val - - return ret - - def determine_wants(self, refs): - """Filter the remote references to figure out which ones - Software Heritage needs. - """ - if not refs: - return [] - - # Find what objects Software Heritage has - refs = self.find_remote_ref_types_in_swh(refs) - - # Cache the objects found in swh as existing heads - for target in refs.values(): - if target['target_type'] is not None: - self.heads.add(target['target']) - - ret = set() - for target in self.filter_unwanted_refs(refs).values(): - if target['target_type'] is None: - # The target doesn't exist in Software Heritage, let's retrieve - # it. - ret.add(target['target']) - - return list(ret) - - def get_stored_objects(self, objects): - if self.ignore_history: - return {} - - return self.storage.object_find_by_sha1_git( - self._encode_for_storage(objects)) - - def find_remote_ref_types_in_swh(self, remote_refs): - """Parse the remote refs information and list the objects that exist in - Software Heritage. - """ - - all_objs = set(remote_refs.values()) - set(self._type_cache) - type_by_id = {} - - for id, objs in self.get_stored_objects(all_objs).items(): - id = hashutil.hash_to_bytehex(id) - if objs: - type_by_id[id] = objs[0]['type'] - - self._type_cache.update(type_by_id) - - ret = {} - for ref, id in remote_refs.items(): - ret[ref] = { - 'target': id, - 'target_type': self._type_cache.get(id), - } - return ret - - -class BulkUpdater(SWHStatelessLoader): - """A bulk loader for a git repository""" - CONFIG_BASE_FILENAME = 'loader/git-updater' - - ADDITIONAL_CONFIG = { - 'pack_size_bytes': ('int', 4 * 1024 * 1024 * 1024), - } - - def __init__(self, repo_representation=SWHRepoRepresentation, config=None): - """Initialize the bulk updater. - - Args: - repo_representation: swh's repository representation - which is in charge of filtering between known and remote - data. - - """ - super().__init__(logging_class='swh.loader.git.BulkLoader', - config=config) - self.repo_representation = repo_representation - - def fetch_pack_from_origin(self, origin_url, base_origin_id, - base_snapshot, do_activity): - """Fetch a pack from the origin""" - pack_buffer = BytesIO() - - base_repo = self.repo_representation( - storage=self.storage, - origin_id=base_origin_id, - base_snapshot=base_snapshot, - ignore_history=self.ignore_history, - ) - - client, path = dulwich.client.get_transport_and_path(origin_url, - thin_packs=False) - - size_limit = self.config['pack_size_bytes'] - - def do_pack(data, - pack_buffer=pack_buffer, - limit=size_limit, - origin_url=origin_url): - cur_size = pack_buffer.tell() - would_write = len(data) - if cur_size + would_write > limit: - raise IOError('Pack file too big for repository %s, ' - 'limit is %d bytes, current size is %d, ' - 'would write %d' % - (origin_url, limit, cur_size, would_write)) - - pack_buffer.write(data) - - remote_refs = client.fetch_pack(path, - base_repo.determine_wants, - base_repo.graph_walker(), - do_pack, - progress=do_activity) - - if remote_refs: - local_refs = base_repo.find_remote_ref_types_in_swh(remote_refs) - else: - local_refs = remote_refs = {} - - pack_buffer.flush() - pack_size = pack_buffer.tell() - pack_buffer.seek(0) - - return { - 'remote_refs': base_repo.filter_unwanted_refs(remote_refs), - 'local_refs': local_refs, - 'pack_buffer': pack_buffer, - 'pack_size': pack_size, - } - - def list_pack(self, pack_data, pack_size): - id_to_type = {} - type_to_ids = defaultdict(set) - - inflater = self.get_inflater() - - for obj in inflater: - type, id = obj.type_name, obj.id - id_to_type[id] = type - type_to_ids[type].add(id) - - return id_to_type, type_to_ids - - def prepare_origin_visit(self, origin_url, **kwargs): - self.visit_date = datetime.datetime.now(tz=datetime.timezone.utc) - self.origin = converters.origin_url_to_origin(origin_url) - - def get_full_snapshot(self, origin_id): - prev_snapshot = self.storage.snapshot_get_latest(origin_id) - if prev_snapshot and prev_snapshot.pop('next_branch', None): - return snapshot_get_all_branches(self.storage, prev_snapshot['id']) - - return prev_snapshot - - def prepare(self, origin_url, base_url=None, ignore_history=False): - base_origin_id = origin_id = self.origin_id - - prev_snapshot = None - - if not ignore_history: - prev_snapshot = self.get_full_snapshot(origin_id) - - if base_url and not prev_snapshot: - base_origin = converters.origin_url_to_origin(base_url) - base_origin = self.storage.origin_get(base_origin) - if base_origin: - base_origin_id = base_origin['id'] - prev_snapshot = self.get_full_snapshot(base_origin_id) - - self.base_snapshot = prev_snapshot - self.base_origin_id = base_origin_id - self.ignore_history = ignore_history - - def fetch_data(self): - def do_progress(msg): - sys.stderr.buffer.write(msg) - sys.stderr.flush() - - fetch_info = self.fetch_pack_from_origin( - self.origin['url'], self.base_origin_id, self.base_snapshot, - do_progress) - - self.pack_buffer = fetch_info['pack_buffer'] - self.pack_size = fetch_info['pack_size'] - - self.remote_refs = fetch_info['remote_refs'] - self.local_refs = fetch_info['local_refs'] - - origin_url = self.origin['url'] - - self.log.info('Listed %d refs for repo %s' % ( - len(self.remote_refs), origin_url), extra={ - 'swh_type': 'git_repo_list_refs', - 'swh_repo': origin_url, - 'swh_num_refs': len(self.remote_refs), - }) - - # We want to load the repository, walk all the objects - id_to_type, type_to_ids = self.list_pack(self.pack_buffer, - self.pack_size) - - self.id_to_type = id_to_type - self.type_to_ids = type_to_ids - - def save_data(self): - """Store a pack for archival""" - - write_size = 8192 - pack_dir = self.get_save_data_path() - - pack_name = "%s.pack" % self.visit_date.isoformat() - refs_name = "%s.refs" % self.visit_date.isoformat() - - with open(os.path.join(pack_dir, pack_name), 'xb') as f: - self.pack_buffer.seek(0) - while True: - r = self.pack_buffer.read(write_size) - if not r: - break - f.write(r) - - self.pack_buffer.seek(0) - - with open(os.path.join(pack_dir, refs_name), 'xb') as f: - pickle.dump(self.remote_refs, f) - - def get_inflater(self): - """Reset the pack buffer and get an object inflater from it""" - self.pack_buffer.seek(0) - return PackInflater.for_pack_data( - PackData.from_file(self.pack_buffer, self.pack_size)) - - def has_contents(self): - return bool(self.type_to_ids[b'blob']) - - def get_content_ids(self): - """Get the content identifiers from the git repository""" - for raw_obj in self.get_inflater(): - if raw_obj.type_name != b'blob': - continue - - yield converters.dulwich_blob_to_content_id(raw_obj) - - def get_contents(self): - """Format the blobs from the git repository as swh contents""" - max_content_size = self.config['content_size_limit'] - - missing_contents = set(self.storage.content_missing( - self.get_content_ids(), 'sha1_git')) - - for raw_obj in self.get_inflater(): - if raw_obj.type_name != b'blob': - continue - - if raw_obj.sha().digest() not in missing_contents: - continue - - yield converters.dulwich_blob_to_content( - raw_obj, log=self.log, max_content_size=max_content_size, - origin_id=self.origin_id) - - def has_directories(self): - return bool(self.type_to_ids[b'tree']) - - def get_directory_ids(self): - """Get the directory identifiers from the git repository""" - return (hashutil.hash_to_bytes(id.decode()) - for id in self.type_to_ids[b'tree']) - - def get_directories(self): - """Format the trees as swh directories""" - missing_dirs = set(self.storage.directory_missing( - sorted(self.get_directory_ids()))) - - for raw_obj in self.get_inflater(): - if raw_obj.type_name != b'tree': - continue - - if raw_obj.sha().digest() not in missing_dirs: - continue - - yield converters.dulwich_tree_to_directory(raw_obj, log=self.log) - - def has_revisions(self): - return bool(self.type_to_ids[b'commit']) - - def get_revision_ids(self): - """Get the revision identifiers from the git repository""" - return (hashutil.hash_to_bytes(id.decode()) - for id in self.type_to_ids[b'commit']) - - def get_revisions(self): - """Format commits as swh revisions""" - missing_revs = set(self.storage.revision_missing( - sorted(self.get_revision_ids()))) - - for raw_obj in self.get_inflater(): - if raw_obj.type_name != b'commit': - continue - - if raw_obj.sha().digest() not in missing_revs: - continue - - yield converters.dulwich_commit_to_revision(raw_obj, log=self.log) - - def has_releases(self): - return bool(self.type_to_ids[b'tag']) - - def get_release_ids(self): - """Get the release identifiers from the git repository""" - return (hashutil.hash_to_bytes(id.decode()) - for id in self.type_to_ids[b'tag']) - - def get_releases(self): - """Retrieve all the release objects from the git repository""" - missing_rels = set(self.storage.release_missing( - sorted(self.get_release_ids()))) - - for raw_obj in self.get_inflater(): - if raw_obj.type_name != b'tag': - continue - - if raw_obj.sha().digest() not in missing_rels: - continue - - yield converters.dulwich_tag_to_release(raw_obj, log=self.log) - - def get_snapshot(self): - branches = {} - - for ref in self.remote_refs: - ret_ref = self.local_refs[ref].copy() - if not ret_ref['target_type']: - target_type = self.id_to_type[ret_ref['target']] - ret_ref['target_type'] = converters.DULWICH_TYPES[target_type] - - ret_ref['target'] = hashutil.bytehex_to_hash(ret_ref['target']) - - branches[ref] = ret_ref - - self.snapshot = converters.branches_to_snapshot(branches) - return self.snapshot - - def get_fetch_history_result(self): - return { - 'contents': len(self.type_to_ids[b'blob']), - 'directories': len(self.type_to_ids[b'tree']), - 'revisions': len(self.type_to_ids[b'commit']), - 'releases': len(self.type_to_ids[b'tag']), - } - - def load_status(self): - """The load was eventful if the current snapshot is different to - the one we retrieved at the beginning of the run""" - eventful = False - - if self.base_snapshot: - eventful = self.snapshot['id'] != self.base_snapshot['id'] - else: - eventful = bool(self.snapshot['branches']) - - return {'status': ('eventful' if eventful else 'uneventful')} - - -if __name__ == '__main__': - import click - - logging.basicConfig( - level=logging.DEBUG, - format='%(asctime)s %(process)d %(message)s' - ) - - @click.command() - @click.option('--origin-url', help='Origin url', required=True) - @click.option('--base-url', default=None, help='Optional Base url') - @click.option('--ignore-history/--no-ignore-history', - help='Ignore the repository history', default=False) - def main(origin_url, base_url, ignore_history): - return BulkUpdater().load( - origin_url, - base_url=base_url, - ignore_history=ignore_history, - ) - - main() diff --git a/swh/loader/git/utils.py b/swh/loader/git/utils.py index 3b46f68..0a72bec 100644 --- a/swh/loader/git/utils.py +++ b/swh/loader/git/utils.py @@ -1,68 +1,75 @@ # Copyright (C) 2017 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information """Utilities helper functions""" import datetime import os import shutil import tempfile -from subprocess import call +from swh.core import tarball def init_git_repo_from_archive(project_name, archive_path, root_temp_dir='/tmp'): """Given a path to an archive containing a git repository. Uncompress that archive to a temporary location and returns the path. If any problem whatsoever is raised, clean up the temporary location. Args: project_name (str): Project's name archive_path (str): Full path to the archive root_temp_dir (str): Optional temporary directory mount point (default to /tmp) Returns A tuple: - temporary folder: containing the mounted repository - repo_path, path to the mounted repository inside the temporary folder Raises ValueError in case of failure to run the command to uncompress """ temp_dir = tempfile.mkdtemp( suffix='.swh.loader.git', prefix='tmp.', dir=root_temp_dir) try: # create the repository that will be loaded with the dump - r = call(['unzip', '-q', '-o', archive_path, '-d', temp_dir]) - if r != 0: - raise ValueError('Failed to uncompress archive %s' % archive_path) - + tarball.uncompress(archive_path, temp_dir) repo_path = os.path.join(temp_dir, project_name) + # tarball content may not be as expected (e.g. no top level directory + # or a top level directory with a name different from project_name), + # so try to make it loadable anyway + if not os.path.exists(repo_path): + os.mkdir(repo_path) + for root, dirs, files in os.walk(temp_dir): + if '.git' in dirs: + shutil.copytree(os.path.join(root, '.git'), + os.path.join(repo_path, '.git')) + break return temp_dir, repo_path except Exception as e: shutil.rmtree(temp_dir) raise e def check_date_time(timestamp): """Check date time for overflow errors. Args: timestamp (timestamp): Timestamp in seconds Raise: Any error raised by datetime fromtimestamp conversion error. """ if not timestamp: return None datetime.datetime.fromtimestamp(timestamp, datetime.timezone.utc) diff --git a/version.txt b/version.txt index fb7c72a..b8c22ed 100644 --- a/version.txt +++ b/version.txt @@ -1 +1 @@ -v0.0.43-0-g7d4a965 \ No newline at end of file +v0.0.48-0-gab5fa4b \ No newline at end of file