Danbooru

Filesize inconsistencies between pixiv and other sites

Posted under General

post #1487853 Seiga
post #1487170 Pixiv

post #1490587 Seiga
post #1490593 Pixiv

Between the Seiga and Pixiv versions, there are no changes. They are exactly same image, except in one area: filesize. I had recently been wondering why the filesize between Hammer's works from pixiv and seiga were different, so I finally decided to test it.

I created 3 different images and uploaded them to pixiv and compared their filesizes between my harddrive, seiga and pixiv. On all three images, the size of the file from my harddrive and seiga were exactly the same, but on pixiv it was reduced.

Another example:
http://seiga.nicovideo.jp/seiga/im3367161
http://nijie.info/view.php?id=58303
http://www.pixiv.net/member_illust.php?mode=medium&illust_id=37980257

All three uploaded with minutes of each other, there are no revisions or such between them, but the pixiv version has a slightly reduced filesize.

I have no idea if this problem happens 100% of the time, but I'm sure this only started recently since I've run into dupe errors when uploading from nijie/seiga in the past.

This is, of course, extremely annoying since it can and has lead to duplicates. I don't think many people here will actually care and there's really not a whole lot we can do about it, but like with the situation with twitpic I posted before(topic #8811), I thought I'd at least make a post exposing the issue.

Updated

It looks like Pixiv is running uploaded JPEG images through the jpegtran optimizer, in hopes of saving space by stripping metadata. Here's a comparison with one of the images Ars mentioned:

$ jpegtran -optimize nico_seiga.jpg > nico_seiga_optimized.jpg
$ md5 *.jpg
MD5 (nico_seiga.jpg) = 1726097181f52f2b70e87dbfe5071f71
MD5 (nico_seiga_optimized.jpg) = 5abd16719080ff8b3003c067abb65697
MD5 (pixiv.jpg) = 5abd16719080ff8b3003c067abb65697

Technically speaking, this can change the appearance of the image — for example, EXIF colorspace information is stripped in this optimization pass — but the DCT-encoded image content is left alone.

S1eth said:

I disagree with your wiki edit. The unaltered nico image should be the parent.
It's not like a duplicate is uploaded a few months later. With that artist, you know that there's a nico version.

For what reason? o_o
According to uxw the pictures are identical save some metadata - and it's discouraged to upload identical pictures.

I don't follow nico yet. I thought about it but it would eat too much additional time. That's not the problem here, though.

Since the pictures are identical save the metadata there shouldn't be a second upload at all - and if there still is it definitely should become the child post. Best solution would be if IQDB could recognize the prior upload, but until it does so (if ever) we have to find an interim arrangement.

I agree with S1eth. The more "authentic" image should always be the parent in my opinion, over an image that has been affected by a third party in any way (same thing with an image uploaded by the artist vs. higher-resolution scans).

Not saying it's a big deal, which is why I put "authentic" in quotes to begin with. It just makes more sense to have as the parent. Which image was uploaded first really should be the last consideration in determining which post is a parent and which is a child.

RaisingK said:

As major a source as Pixiv is, would it be worth having Danbooru automatically check the jpegtran output when a user tries to upload something?

It would help if and only if pixiv image was already uploaded.

Schrobby said:

Those double uploads are starting to get ridiculous. Something has to be done.

For now, I recommend you (and all other uploaders) use the find similar function before uploading each image. It only takes about a second.

1 2 3 4 5