Page MenuHomeSoftware Heritage

Documentation overhaul
ClosedPublic

Authored by seirl on May 17 2022, 1:54 AM.

Details

Summary

Large rework of the entire documentation of swh-graph.

  • New tutorial on how to use the Java API.
  • New page on disk/memory tradeoffs
  • Update the compression pipeline documentation with extensive details
  • Fix the compression steps diagram
  • Merge use-cases "blueprint" documentation in the HTTP API page as usage examples

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
docs/memory.rst
72–74

Anything where you can just scan the graph sequentially, like if you want to compute stats on indegrees / outdegrees, count nodes, etc.

docs/memory.rst
123–125

(Not here actually, fixing this.)

Build is green

Patch application report for D7839 (id=28333)

Rebasing onto 579f5a9e89...

Current branch diff-target is up to date.
Changes applied before test
commit 2141188fb34414bb58d42231b5cc5f0d16cb9d6a
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Tue May 17 01:49:41 2022 +0200

    Documentation overhaul
    
    Large rework of the entire documentation of swh-graph.
    
    - New tutorial on how to use the Java API.
    - New page on disk/memory tradeoffs
    - Update the compression pipeline documentation with extensive details
    - Fix the compression steps diagram
    - Merge use-cases "blueprint" documentation in the HTTP API page as
      usage examples

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/189/ for more details.

Build is green

Patch application report for D7839 (id=28334)

Rebasing onto 579f5a9e89...

Current branch diff-target is up to date.
Changes applied before test
commit 7a1901476f0dc662629ff05878f39430efc6837a
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Tue May 17 01:49:41 2022 +0200

    Documentation overhaul
    
    Large rework of the entire documentation of swh-graph.
    
    - New tutorial on how to use the Java API.
    - New page on disk/memory tradeoffs
    - Update the compression pipeline documentation with extensive details
    - Fix the compression steps diagram
    - Merge use-cases "blueprint" documentation in the HTTP API page as
      usage examples

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/190/ for more details.

Build is green

Patch application report for D7839 (id=28335)

Rebasing onto 579f5a9e89...

Current branch diff-target is up to date.
Changes applied before test
commit a0c974f5215e273ab04a29554456548f4b9b8316
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Tue May 17 01:49:41 2022 +0200

    Documentation overhaul
    
    Large rework of the entire documentation of swh-graph.
    
    - New tutorial on how to use the Java API.
    - New page on disk/memory tradeoffs
    - Update the compression pipeline documentation with extensive details
    - Fix the compression steps diagram
    - Merge use-cases "blueprint" documentation in the HTTP API page as
      usage examples

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/191/ for more details.

Still reviewing the rest but went through the quickstart. I wanted to submit these comments before it got too late over there.

docs/quickstart.rst
17

Starting from a raw AWS ubuntu instance, this needs python3-venv for the next step

26–27

This command didn't work for me, "/usr/bin/python3: Relative module names not supported"

python3 -m venv workingDir did work. Then the next line needed to be 'source workingDir/bin/activate'

In addition, venv did something weird on my first try, but on my second after doing a full apt update ; apt upgrade it did not (create an extra venv dir?). I'd highlight that package upgrades need to be run to ensure commands work as written.

33

FYI this gave me an error "ERROR: Failed building wheel for blinker"

The next command worked though.

65

May be worth noting, this graph will NOT load on an aws nano instance (lack of ram). I'm not sure if a micro can run it or not, but I succeeded with a t3a.small instance. Might save some frustration for someone trying to do this on the AWS free tier if they knew the requirements in advance.

84
109

Two things; 1. I think the "from another terminal" addition is helpful. I would EXPECT people to understand that, but they might not notice the prompt below and think they can pass these commands directly into the graph terminal after they run rpc-serve.

And 2. Following these instructions was giving me 400 errors. I'm not sure why it isn't able to read the SWHID correctly?

http :5009/graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323
HTTP/1.1 400 Bad Request
Content-Length: 18
Content-Type: text/plain; charset=utf-8
Date: Tue, 17 May 2022 18:03:55 GMT
Server: Python/3.8 aiohttp/3.8.1

Unknown SWHID: swh

This is on a brand new ubuntu t3a.small instance on AWS, after doing apt update;upgrade.

docs/api.rst
424

For at least one of the below, I'd give a full example command with full output. Like the httpie example in the quickstart. Only needs to be shown once but can help prompt someone who is speeding through.

Also, something somewhere needs to mention the wierdness/problem of the compressed graph wanting the swh:1:ori:HASH identifier. Or does the rpc-api convert uri's given into proper ID's? What does it spit back out if you traverse to an origin?

I can test it on my system but I don't know why it is not liking SWHID's given in the http request.

docs/java-api.rst
32
38

Does this ever apply to SWH, anywhere? It is interesting to know but might confuse people who haven't yet read that everything in SWH is unidirectional.

41–52

I think this section needs to be lower down., or maybe in an advanced details page. It's an implementation detail that is useful to know when someone is down to the nuts and bolts, but not information they need to know or understand when doing more basic or early work. And when someone gets that far down, they probably want to know about more of the files than just these.

70

Shouldn't we encourage people to use the SWH wrappers w/ properties support, etc, instead of the ImmutableGraph directly? And above, the ImmutableGraph info is helpful when someone gets stuck or tries to push the limits as I have, but initially people more just want to know how to utilize the SWH wrappers you've built.

78–80

It's more work but I'd suggest a simple "hello world" java project quickstart for this. Barebones like load a graph, convert a swhid to a nodeId, step forward arbitrarily, print the new nodeid or SWHID, exit. Include imports, compilation and execution example commands.

I say this as someone frequently tackling projects of varying sizes. The more of a foundational example I have to build from, the more likely I am to start things off right (or not get frustrated and switch to a different solution/approach/source entirely). I hate trying to wrangle with project setup, dependency hell, path and environment variables setup, and every language & project is different. So if there's a simple working "hello world java" thing, I'm going to do exactly that before I start writing my own stuff.

For example I've hacked my rather complicated(at this point) project into the SWH utils directory, just because I hadn't worked with java-linux-CLI stuff like this before and wanted to get something going quickly (and it worked!). If I were to redo it, I'd wrap it into a separate project as you describe above.

111
112
189

I'll add a comment on the other page, but the performance & tuning page really needs to mention the DSI fastutils classes and give some examples. HashSet<Long> won't scale to the big dataset for most usecases, but they won't realize that until they try to run it and fail.

I'll try to note some good examples when I get to that page.

238

This mention / introduction can be much higher. There's many reasons people would want to use SwhUnidirectionalGraph, and relatively few I can think of where they should use ImmutableGraph directly instead. It might be better to pretend ImmutableGraph doesn't exist until deeper in the page.

317–324

I'd also mention getUrl and that origin-url's are stored as a node property, not an edge property, The difference gets confusing when someone is trying to understand directory & revision labels.

352

I haven't tested it recently but loadLabelledMapped didn't work for me a month or two ago, FYI. Crashed/exception infinite loop when trying to load it.

384

Helpful to just show converting it into what people will want to use.

393
394–396

More clear with real example files I think.

398–400
514–516

I think somewhere it would be good to mention or list some of the utilities that are in java/src/utils. They're both potentially useful for debugging and also good references for how to do some things. In particular, DumpProperties.java, FindEarliestRevision.java, and MPHTranslate.java.

docs/memory.rst
1

I suggest adding a header for java types and performance suggestions.

  1. If someone is intending to traverse or iterate the big graph for anything meaningful, they almost certainly want to plan to do so multithreaded from the very beginning of their code.

Here's a sample approach I have adopted. I have no idea if it is good or bad or can be improved, but it does function. This code should iterate all of the labelnames with NUM_THREADS and count the number of ".cpp" labelnames that are longer than 50 characters, and return the longest filename found & print both (not something I need to do but a simple example with enough depth to be modified to fit many uses I think).

		class longFileNameReturn { public String longestName; public long numLongNamesFound; }
		class longFileNameFinder implements Callable<longFileNameReturn> {
			public int workerId;
			public int numWorkers;
			public String graphPath;
			public longFileNameReturn call() throws IOException {
				SwhGraphProperties props = SwhGraphProperties.load(graphPath);
				props.loadContentLength();
				String longestFilename = "";
				long numLongFilesFound = 0;
				long numFilenames = getMPHSize(NodeIdMap.loadMph(graphPath + ".labels.mph"));
				// Iterate...
				for (long currNameId = workerId; currNameId < numFilenames; currNameId += numWorkers) {
					String labelName = new String(props.getLabelName(currNameId));
					if (labelName.endsWith(".cpp") && labelName.length() > 50) {
						System.out.println("Found long filename " + labelName);
						numLongFilesFound++;
						if (labelName.length() > longestFilename.length()) {
							longestFilename = labelName;
						}
					}
				}
				var ret = new longFileNameReturn();
				ret.longestName = longestFilename;
				ret.numLongNamesFound = numLongFilesFound;
				return ret;
			}
		}
		ExecutorService executorService = Executors.newFixedThreadPool(NUM_THREADS);
		var results = new ArrayList<Future<longFileNameReturn>>();
		for(int i = 0;i<NUM_THREADS;i++) {
			var worker = new longFileNameFinder();
			worker.workerId = i;
			worker.numWorkers = NUM_THREADS;
			worker.graphPath = graphPath;
			results.add(executorService.submit(worker));
		}
		for(int i = 0;i<NUM_THREADS;i++) {
			var res = results.get(i).get();
			System.out.println("Thread "+i+" found "+res.numLongNamesFound+" long filenames and the longest was: "+res.longestName);
		}
		executorService.shutdown();
  1. The DSI libraries are your best friend. If tracking nodes you've already seen when traversing, a HashSet<Long> stores 16 bytes per node touched and 8 bytes for empty buckets in the HashSet. Traversing the whole full dataset would oversize the array and fail, but even if you used multiple arrays it would take up ~328 gigabytes of ram just to track seen nodes. If you instead make a dsi.bits.LongArrayBitVector, you only need 1 bit per node in the graph, or about 2.5 gigabytes of ram, and a single array version of that will continue working until the SWH dataset has ~137 billion nodes.

Similarly, LongArrayList saves at least 8 bytes per entry over ArrayList<Long> and Long2LongOpenHashMap saves at least 16 bytes over every entry over HashMap<Long, Long>.

  1. When traversing, always always track and curtail repeated visits to the same nodes. The SWH graph provides an enormous number of paths to reach the same set of nodes. A simple single-origin traversal that appears to work fine on a randomly selected origin in the python3k dataset might take days on the full dataset if repeated node visits aren't short-circuited.

Those are the tips I have that I think would be good to add somewhere on this page.

18–27

Suggest adding -ea to enable assertions by default. I didn't realize Java disabled them by default until later. If they want them disabled, they will figure it out when they have a need to.

docs/quickstart.rst
109

FYI this was what the other terminal showed as recieved. It all looks right, I'm not sure why it didn't work.

~/2021-03-23-popular-3k-python$ swh graph rpc-serve -g compressed/graph
INFO:root:using swh-graph JAR: /home/ubuntu/workingDir/share/swh-graph/swh-graph-0.5.2.jar
Loading graph compressed/graph ...
Graph loaded.
======== Running on http://0.0.0.0:5009 ========
(Press CTRL+C to quit)
INFO:aiohttp.access:127.0.0.1 [17/May/2022:18:02:10 +0000] "GET /graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 HTTP/1.1" 400 178 "-" "HTTPie/1.0.3"
INFO:aiohttp.access:127.0.0.1 [17/May/2022:18:02:27 +0000] "GET /graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 HTTP/1.1" 400 178 "-" "HTTPie/1.0.3"
INFO:aiohttp.access:127.0.0.1 [17/May/2022:18:02:38 +0000] "GET /graph/leaves/dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 HTTP/1.1" 400 180 "-" "HTTPie/1.0.3"
INFO:aiohttp.access:127.0.0.1 [17/May/2022:18:03:11 +0000] "GET /graph/visit/nodes/swh:1:rel:0000000000000000000000000000000000000010 HTTP/1.1" 400 178 "-" "HTTPie/1.0.3"
INFO:aiohttp.access:127.0.0.1 [17/May/2022:18:03:55 +0000] "GET /graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 HTTP/1.1" 400 178 "-" "HTTPie/1.0.3"
docs/java-api.rst
317–324

Also something, somewhere, needs to document the whole swh:1:ori:LONGHASHSTRING issue; The website doesn't seem to support ANY ori:HASH lookups, whereas the graph ONLY supports ori:HASH lookups. A reverse mapping might be very helpful, and the ability to lookup by ori:HASH on the website would be helpful, but at least documenting the difference is a must here.

Sorry if I mentioned this twice, can't recall if I mentioned it already.

docs/memory.rst
1

Oh! Oh! And make sure to mention the graph.copy() process for multithreaded! My example doesn't show that, but it's super important that that's documented somewhere when multithreaded or graph loading it mentioned.

Monumental documentation work, thanks!
I think this is generally great, and I've pointed out only some minor issues/suggestions here and there.

docs/compression.rst
36–37

Minor: we are not sure yet about this, especially on NVMe drives. I'd generalize this in saying it could take "a lot (months)", or something such. We'll refine when we have actual numbers.

43

There is a terminology dichotomy in the doc: sometime we call these "edge labels" and sometime "edge properties", and I don't think it's consistent. Either we uniform to always use the same term (even if it's arguably imprecise), or we use the latter for the abstract concept and the former for the on-disk serialization format used by WebGraph.

54–57

Minor: this reads like OVH advertisement :-) Just say, factually, that we rented the HGR-HCI-4 instance (with link) from OVH for compression instead?

109

to better understand the output, from here we can point to the Java API documentation file, which contains a description of what each file does/means

258–259

Expand this to a full citation with link, here are the relevant info

citation: Paolo Boldi, Marco Rosa, Massimo Santini, Sebastiano Vigna: Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks. WWW 2011: 587-596
DOI: https://doi.org/10.1145/1963405.1963488
preprint: https://arxiv.org/abs/1011.5425

308–310

same LLP citation used above, probably to be factored out with a common link

docs/java-api.rst
225–226

Given this info is split throughout this file (and referenced from outside), I'm wondering if it should be put in a separate, self-contained page ("compressed graph on-disk format", or something such) and referenced where it's needed. I suspect a lot of people will want to read/reference that, even if they don't need to understand the Java API.

seirl marked 21 inline comments as done.May 19 2022, 3:48 PM

Thanks for all the reviews, I made a first pass with the easiest fixes.

docs/java-api.rst
38

For any algorithm where you don't care about the direction, yes. For instance if you want to compute connected components, for LLP, or even for a BFS if you want.

41–52

I disagree, it's useful to know in advance because it tells you what are the files you need to download for your particular use case.

112

I don't understand this comment, the goal here is to introduce the "neighbor" terminology.

352

Fixed in https://github.com/vigna/webgraph-big/pull/5 we just need to wait for a new release with this change

384

We don't want people to assume that these bytestrings have a defined encoding though. It could be arbitrary bytes.

docs/quickstart.rst
33

Newer versions of pip just show that as a warning.

84

?

seirl marked an inline comment as done.
seirl added inline comments.
docs/api.rst
424

FYI, the ORC graph dataset now has an "id" column in the origin table, specifically to convert back from these sha1s to the URLs. It's now very similar to the other nodes, and it's already documented in the documentation of the dataset (which is the correct place to put this, imo)

seirl added inline comments.
docs/api.rst
424

Regarding the output, the above page has a ton of examples already. I just put this here to remove the outdated use-cases page, but it still feels a bit clumsy. Not sure what a better way to present this would be.

Build is green

Patch application report for D7839 (id=28395)

Rebasing onto 579f5a9e89...

Current branch diff-target is up to date.
Changes applied before test
commit 9ce4feb5b3ed5a76d0a27c0fc29e55621003813e
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Tue May 17 01:49:41 2022 +0200

    Documentation overhaul
    
    Large rework of the entire documentation of swh-graph.
    
    - New tutorial on how to use the Java API.
    - New page on disk/memory tradeoffs
    - Update the compression pipeline documentation with extensive details
    - Fix the compression steps diagram
    - Merge use-cases "blueprint" documentation in the HTTP API page as
      usage examples

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/192/ for more details.

seirl added inline comments.
docs/compression.rst
43

I'm using "edge labels" everywhere now.

docs/java-api.rst
70

Again, it depends. The Swh*Graph wrappers are great, but they require other files. We want people to be able to only fetch the minimum they need for their analysis. I'm leaving that as-is but I'm writing a note that says most of the time they'll want to use SwhUnidirectionalGraph.

78–80

I think having an "examples" directory is a great idea. I'll do that in another diff.

225–226

For me, it's really important that the files are documented in this file, because we want to encourage people only fetching the files they need for their own research goal. We could add another page, but the information is going to be even more spread out. I feel like the current version is a good compromise, but if someone has a better idea maybe we could do that in a subsequent diff.

238

Disagree for the reasons mentioned above, but I did leave a note earlier.

514–516

These should probably go in the examples dir! Out of the scope of this diff I think, but good idea regardless.

docs/java-api.rst
41–52

That's a good point. Then I guess more to Zack's point, putting the full list in one spot would help instead of having to scroll and learn piecewise. It's the kind of info people will reference but not learn.

112

I think that trying to use neighbors as terminology is going to confuse people, and adjacent is almost as bad. Neighbors don't have a direction; Successors and predecessors ALWAYS do. The graph ALWAYS does. Even if a person is swapping between viewing it in reverse or forward to find connections to a node, they still have to do two separate operations to find the 'before neighbors' and 'after neighbors', so they still need to be aware of the distinction that everything is either a before neighbor or an after neighbor, at least at first.

Maybe subsequent steps isn't the best word, but neighbors drops the directionality entirely. For me learning the structure of this early on, I constantly had to remind myself that "adjacent" nodes always meant before or after (or as above it means "before plus after" but the distinction is still there).

225–226

I'd suggest moving it to a lower section and hyperlinking to it. You can list the relevant ones inline if you want, without bullet points, and people can find the info about them when they need it. As I mentioned, it's the kind of information they won't "learn" when reading, but they will want to reference when they get to that part of their project (I did and am!).

384

Are there any edges that have arbitrary bytes right now? No, right, they are all strings at least?

I think in that case, that's a very unlikely exception to a rule, kinda like having a directory named "/" or " ", or a file named "."

As an exception to the general rule, it would be worth mentioning it in a bullet point or casually, but I'd still suggest showing the normal-case handling.

If the structure changes enough to make "arbitrary bytes" a very common case, then update the documentation. I found "bytes" to be confusing personally when I knew I wanted strings and knew I was (?almost always?) dealing with strings.

docs/quickstart.rst
84

Oh, I guess that's singular its not plural possessive. Nevermind.

seirl added inline comments.
docs/java-api.rst
112

I think "outgoing neighbors" should fix the issue.

384

All the edges are arbitrary bytes, actually. Git doesn't normalize to unicode, and neither do we. These should really be handled as bytes, unless you're willing to lose data by ignoring encodings you can't decode.

This revision was not accepted when it landed; it landed in state Needs Review.May 20 2022, 7:25 PM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.

Build is green

Patch application report for D7839 (id=28443)

Rebasing onto 579f5a9e89...

Current branch diff-target is up to date.
Changes applied before test
commit aaed82fc23e9d0f28b24da7bb9c5ccd161a6abb3
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Tue May 17 01:49:41 2022 +0200

    Documentation overhaul
    
    Large rework of the entire documentation of swh-graph.
    
    - New tutorial on how to use the Java API.
    - New page on disk/memory tradeoffs
    - Update the compression pipeline documentation with extensive details
    - Fix the compression steps diagram
    - Merge use-cases "blueprint" documentation in the HTTP API page as
      usage examples

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/193/ for more details.