Artist Info Check Script / Artist DB Backup

Posted under General

This is a "small" perl script to make it slightly easier/quicker to gather artist info from certain sites (Twitter/Seiga/Pixiv/Tinami).
It also cleans up certain links to a more consistant format, aswell as removes certain links that provide no way of pointing to the artist (NicoNico/Tinami img urls for example).
Filenames are also removed from certain links if they are jpg/png/gif, to allow the find artist function to work properly.

I mainly wrote this to do a mass-backup/cleanup of the artist database, but I guess this could be useful for quickly getting info for new artists etc.
It's not exactly the most user-friendly script in the world, but it does what it needs to do.

This basically works by grabbing the links via the API, sending them through a url cleaner/blacklist.
It then screenscrapes certain urls (Pixiv/Tinami/Seiga/Twitter(+Twitpic/yFrog/Twitgoo/Twipple/img.ly)) for any whitelisted links, and adds them to the list of urls.
If a twitter link exists, Twitpic/yFrog/Twitgoo/Twipple/img.ly are also checked for images, and a link added if exists. (Can't check actual twitter images since they don't provide a way to.)
Names are also added if they differ from the artist tag and aren't already in OtherNames.
Example: image

Urls that are cleaned are mainly twitter/seiga urls, aswell as a few others.
For instance, http://seiga.nicovideo.jp/user/illust/18617301 would become http://seiga.nicovideo.jp/user/illust/18617301?target=illust_all etc.

How to use:

  • 0.1: Install perl if you don't already have it (See below).
  • 0.2: Open up the download.pl in notepad or similar. Edit the settings with the info required.
  • 0.3: Run the batch script to install the modules (If not already installed)
  • 1: Open up the folder with the .pl file in CMD/Terminal.
  • 2: Type "download.pl ARTISTNAME" (So "download.pl caidychen" for example), it should check through everything and ask you to confirm update.
  • 3: Simply type Y then enter to update. (Anything else to stop the update)

Download: Link | Source (Updated 06/08)

If you're not familar with perl, you may need to install it (If you are on windows). If you're on Linux/Mac it "should" already be pre-installed.
It also requires a few extra modules which need to be installed via CPAN. I've included two batch scripts (One for windows, another for linux/mac), which should install all the required modules.

Feel free to suggest if you have any ideas :)

Updated by RaisingK

I've fixed the issue mentioned above, aswell as a few other minor things.
Mainly dead urls being now logged to a txt file (And not breaking the script if you find one), better checking if url has filename or not and some urls not being cleaned.

The main reason I was grabbing the extra pixiv url was due to pixiv slowly changing how their system works.
They've made multiple (but minor) changes to the API lately, aswell as the obvious change to the new pixiv url format.
Considering it gets regex'd anyway, I don't see why the old format should be kept over the new one.

On another note..
Over the past few days I've managed to run the script through every single artist on the DB (75K~ ish), mainly due to it being much faster to actually do this, aswell as being able to fix broken links in-mass.

There is around 337K~ links in total. I know for a fact there is several duplicates in there, aswell a few broken links (Due regex being a pain at grabbing links from text blocks).
Wanting to avoid doing a mass artist update mainly for this reason.

Kept an SQL backup of the entire thing aswell, as it's much easier to use that to check for duplicates, aswell as check multiple blogs at once.
So if anyone would find any use for it, here is an SQL backup of the entire thing.
Here's also a TXT version of that, which is a bit easier to look through if you don't have access to a SQL server.