Danbooru

Request: Danbooru DB dumps

Posted under General

This is a request to albert: would it be possible to have DB dumps available for grabs? This has been touched on already in forum #16251 long ago. Basically, having an upstream-provided dump of the DB with only private info sanitised would be much, much cleaner and easier for everyone than scraping the API for the same info. Especially since the API is incomplete in several regards and working around that would be an unnecessary waste of time.

Ideally, don't know how feasible that is, there would be a baseline dump with incremental updates posted regularly, say once a week. If not that, having a regularly run automatic dump (with archived old versions available, please) would do. That would make it so much easier to run a mirror, experiment on your own with possible features, etc. Also, if you want to do that, and want the image data, I have a mirror of everything up to about two weeks ago which can be made public. Use that instead of raping danbooru with 210GB of unnecessary traffic (that's how much the images take at the moment).

And to kickstart the process, here's the checklist of things I believe will need to be sanitised before the dump is published:

  • Passwords (obviously)
  • Mailboxes
  • User email addresses
  • IPs

Things that are of interest and should be retained if possible are specifically:

  • User accounts as such, including the level, posts, edits, etc. That is all interesting for the same reason invites are
  • Favourites
  • Comments
  • Invites (they're public data anyway, and provide very valuable insights on the network-of-trust which could be used to do interesting things regarding community management)
  • Bans and records
  • Notes, tags, artists, wiki, pools and their history (obviously)
  • Definitely all of tag aliases and implications. That info isn't reflected *at all* in the API, and the whole tag API is buggy, so it's particularly crucial.
  • Forum
  • Popularity stats, as well as the running updates for deletions and note changes, if that's tracked separately and not calculated from existing data

If there's any data not listed above, the default should be to include it. And if you disagree with any of the above points, please comment. The goal here is to work out the set of data that can be exported without objections, and then have it happen automatically.

Updated by EMUltra3

I'd second this request. While I've never done anything super substantial with the data, I have occasionally done full API scrapes of the posts and taglist for experimentation and exploring the data. To set something automatic up would be preferable both from a utility and efficiency standpoint.

I'd probably also add e-mail addresses as something to be sanitized though. They currently aren't public, and I'm sure many users would prefer they not be made so. Also the thought of opening personal e-mails to be collected by potential spammers is a bad idea.

The database has gotten large enough that sanitizing all the sensitive information takes hours. It's not just passwords and dmails I'm uncomfortable with exposing, but I also consider IP addresses to be sensitive.

It's more feasible if you're asking for a subset of tables.

The tables containing the actual content (posts, tags, wiki, pools, etc) don't contain sensitive information, correct?

I'm not as interested in the user information, but if we wanted to produce that as well, would it be possible to set up a something like a materialized view in Postgre for those tables sans sensitive information and export that instead of the original tables. Though I suppose generating that materialized view would be the same as it takes to sanitize the data in the first place...

Shinjidude said:
I'd probably also add e-mail addresses as something to be sanitized though.

albert said:
It's not just passwords and dmails I'm uncomfortable with exposing, but I also consider IP addresses to be sensitive.

Definitely yes, on both counts. I haven't listed them because I didn't remember them being stored. Will update the OP list.

It's more feasible if you're asking for a subset of tables.

What about a materialised view, as Shinjidude suggested? Alternatively, what about setting the sanitising job to a very low nice and letting it run overnight so that it doesn't sap the server? It doesn't really matter if the weekly snapshots are 12 or 24h out of date, as long as they're available.

I'd really like to see it as complete as possible, because it's hard to explore the really interesting ideas otherwise, and user/community-related ones are some I wanted to look at the most.

Edit: I guess the blacklists would also be considered private information and not exported? Though, OTOH, we have the list of "uploaded tags" (ie. the thing showing under My Tags when you edit a post) public, and that's really the same kind of information. And it could be really useful for a recommendation engine, which I'd like to see, as a beefed-up replacement for the lost "people who favourited images you favourited include:" thing. So however we go about it, we should probably align the stance on blacklisted and uploaded tags to be consistent.

Updated

Thirding this. The main thing I'm interested in is the favorites table. The post votes table would be nice too, if anonymizing the data isn't too much trouble. I was playing with recommendation algorithms a bit a while back, and having votes in addition to favorites would be extremely useful.

葉月 said:
Definitely all of tag aliases and implications. That info isn't reflected *at all* in the API, and the whole tag API is buggy, so it's particularly crucial.

I posted a dump of these a few days ago in forum #36244. Aliases and implications actually are accessible through the API. It's just undocumented, like many parts of the API.

Shinjidude said:
The tables containing the actual content (posts, tags, wiki, pools, etc) don't contain sensitive information, correct?

Most tables contain the IP and user ID of the last updater.

Edit: I guess the blacklists would also be considered private information and not exported?

Blacklists are already accessible through the API. Arguably this is a bug, but like you say, it's really no more revealing than any of the other user info that's already public. Last time I checked, the data wasn't very interesting anyway. Not many people used blacklists, and the ones who did mostly just blacklisted stuff like futanari, guro, scat, tentacles, yaoi, etc.

Bump again. Albert, any progress? I've seen no reply to the technical proposals to limit the performance impact of the dumping process, and I'd *really* like to avoid scraping everything through the API.

@ Hazuki

So do I. The more load-intensive features that get identified and disabled, the more I'd like to re-implement them in-house, so that we can have them without straining the server.

@ Bapabooiee

The whole point to this thread is that while the code is open, the data isn't fully available. What is currently requires abusing the API or otherwise scraping the system to retrieve. This is generally bad for the server if done with everything by multiple people, and especially so if those people do it periodically to keep things up to date.

It would be much more efficient to have a repository where everything could be gathered without using up much CPU-time or DB-load and perhaps in such a way that bandwidth could be pooled.

Updated

Bapabooiee said:
Danbooru's open source, so someone could always grab a copy of it, get their own version running, populate the database with random (or not-so-random) data/posts, and then develop this feature themselves. Then maybe it can be fed upstream where it can be integrated into the Danbooru we all know and love.

I don't even know how to respond to that, it's so full of cluelessness.

Bapabooiee said:
Couldn't you just edit the OP the next time you need to bump the thread? That way, we can avoid adding extra replies to the thread.

No one thinks twice about a regular bump, but some people might think there's a site error if there are a lot of ghost bumps.

1