Danbooru

Meaning of the related tags

Posted under Bugs & Features

What exactly does the related tags field for API endpoints such as tags.json?search[name] signify?
Since the value is often between 0 and 1 I assumed they are percentages. Then I tried to search for vanilla and and got values above 1 and now I am confused.

Related tags are used to suggest similar tags during tag editing. For example, if you're editing a cirno post and you click the cirno tag in the tag edit box, then you click the "Related tags" button, you'll get a list of other tags related to cirno: ice wings, daiyousei, blue dress, touhou, and so on. They're also listed in the sidebar when you do a cirno search.

Relatedness of tags is measured by cosine similarity. The formula is this:

similarity(tag1, tag2) = {{tag1 tag2}} / sqrt({{tag1}} * {{tag2}})

where:

{{tag1}} = the size of tag1
{{tag1 tag2}} = the size of a {{tag1 tag2}} search

So for example, the similarity between cirno and ice wings is calculated like this:

{{cirno}} = 25338
{{ice_wings}} = 6474
{{cirno ice_wings}} = 6399

similarity(cirno, ice_wings) = 6399 / sqrt(25338 * 6474)
similarity(cirno, ice_wings) = 0.4996

The intuitive way to look at this is that two tags are highly related if they have many posts in common and they're the nearly same size. To put it another way, it's a measure of how close the two tags are to being synonyms, meaning they describe the exact same set of posts.

So for example, serval (kemono friends) is very highly related to serval ears because nearly all Serval posts are tagged serval ears, and nearly all serval ears posts are tagged Serval. Serval (kemono friends) and serval ears are virtually the same tag.

Ice wings is less related to Cirno because while nearly all ice wings posts are tagged Cirno, not all Cirno posts are tagged ice wings. Blue hair is less related than ice wings because, while most Cirno posts are tagged blue hair, few blue hair posts are tagged Cirno.

The exact values in the related tags field may differ from this formula because of certain optimizations. For small tags, the values are just the number of posts the tags have in common (that is, {tag1 tag2} instead of {tag1 tag2}/sqrt({tag1}*{tag2}). For very large tags, we only include posts from the past year or so ({tag1 tag2 age:<1year} instead of {tag1 tag2}).

Thank you for the detailed answer. Now it makes sense to me.

I made a mistake there by thinking that the vanilla-tag refers to something other than actual vanilla.
That certainly explains the low number of posts when searching for vanilla though.

1