Tiers ≫ Stars

written February 5, 2023

In 2022 I recorded a brief review for each piece of media I consumed, along with a base-10 rating, like 7/10. It was a journaling exercise, mostly, but an unexpected result was significant reflection on 10-point systems and my misgivings with them. I had no good reason to distinguish a 1 from a 2, a 3 from a 4, or a 5 from a 6. It seemed to depend on the mood I was in when I recorded the rating, which infuriated me.

After the year concluded, I did a deep dive on rating systems in search of something preferable. I documented my findings and reflections in the post linked here:

2023-01-12

Ratings Systems

#media#design#language#data

In my own ratings, I've always been dissatisfied with stars, thumbs, and numbers. Each is either too coarse or too fine. I remember as a teenager grappling with iTunes and Windows Media Player, debating with myself what the difference was between a 1-star and a 2-star rating if they both meant "bad" to me, and discussing with friends their alternative philosophies. Our inability to align on their meanings indicated to me a failure of the system to adequately communicate.

During my exploration of ratings linked above, none of the systems I found stood out as something I would prefer to use myself. Inspired by the more ornate systems I encountered, I decided to take a bottom-up approach, working first on my data then letting a visual representation emerge.

Analysis

I started with the issues that prompted my concerns - the inability to differentiate 1&2, 3&4, and 5&6 in the 10-point scale. Clearly those six numbers only mapped onto three distinct categories in my mind.

↓

1-2

3-4

5-6

This means I only have seven distinct perceptual categories, and 7/10 corresponded to the middle rank. It acts like a fulcrum, establishing polarity and parity between the ranks above and below it.

1-2

3-4

5-6

I believe every American has been programmed to equate 70% with a C grade and average-ness. I was feeling the tug of that programming, and it was causing score inflation. It also was the first indication that I was ranking as if I was grading.

Detour

The American school system only has five grades (A, B, C, D, F), so at first I set aside the idea of using letters and explored alternatives.

Seven is a tricky baseline to work with for a rating system. The rating can't be used by itself, as in "6", because it doesn't communicate a clear maximum or, at worst, reads as though the maximum is ten. Representing a score as a number-out-of-seven, like 5/7, is used as a joke and reads like one, so that's also out.

I experimented with modifying 5-star ratings to have seven ranks. If I counted zero stars as a rank, that got me a bottom score, but the top rank was trickier. I tried special top ranks: a larger center star, different coloration, or just a 6th star. I also tried using 6- or 7-star system throughout.

☆☆☆☆☆

★★★☆☆

★★★★★

All felt forced and unsatisfying. Each shared two problems: the middle rank reads as either above average in most cases or below average in the case of a 6-star system; and the ratings had to be represented with long, horizontal visuals. A smaller, scrunched cluster of stars would be too hard to visually parse without significant design work.

Providence

It took a long time for tier rankings to click. In retrospect it seems obvious. I think it wasn't on my radar until I started playing certain kinds of games where tier lists are common in their Youtube ecosystems. I hadn't thought to count the ranks before, but once I saw it I couldn't unsee it.

1-2

3-4

5-6

S-rank has been around for a long time, apparently Japanese in origin. In English, its natural reading as "Special" makes it read well even to the uninitiated.

The real trick of tier list ranks is the introduction of E between D and F. From a letter-grading perspective, this is non-obvious. Historic grading systems have used E as a failing grade, but I'm not aware of a system that used both letters. Using both letters gives us an odd number of ranks, making the middle obvious, and letting C act as the natural fulcrum we expect it to be. That conformance to expectations is what makes these so intelligible.

Interpretation

A friend pointed out that this all bared a striking resemblance to a 7-point Likert scale. I was doing a one-dimensional qualitative analysis with a neutral middle rank, strongly negative and positive extremes, and two gradations in between which I think of as analogous to the difference between North-Northwest and West-Northwest. I think he was spot on. But what was I measuring?

When you score things with numbers, you feel like you're assessing perceived quality. It didn't occur to me that I could be measuring something else entirely, but while sifting through my review notes I saw a different pattern. I was judging things by worthwhileness – my sense of how respectful they were of my time.

This realization resonated deeply. I have been vocally judgemental towards unscripted content, like streaming and most podcasts, for many years and for this exact reason. I feel they do not respect my time in the way that carefully edited content does.

That deep resonance is exactly what I was looking for. It allows me to unambiguously place things into these buckets by pure intuition and independent of mood.

STIER

Doesn't waste a moment of my time, capturing all of my attention. If it slips even a little, it gets an A at best. The first consumption becomes a dedicated activity, not pairing well with multi-tasking.

ATIER

Worthwhile with only minor engagement issues, often matters of taste or creative differences. A five-tier system would fold this into the top rank.

BTIER

An honorable mention. Occasionally captivating, but it disappoints with severe, obvious stumbling blocks that feel more like mistakes and lost potential than differences of opinion.

CTIER

Background noise – optionally preferable to silence, which is actually praiseworthy. It commands no attention, but when attention is given it does not offend. These qualities allow this media to occupy niches in our lives which others cannot.

DTIER

A few worthwhile aspects save this from being a complete waste of time. Generally, only fans of the media's genre should bother with these.

ETIER

This might be the worst, because it actively wasted my time. If it had been more transparently awful up front, at least I could have reallocated my time.

FTIER

So much worse than sitting in silence and doing nothing that it's not worth finishing or giving it a "fair" shake.

Errata

No fractional scores.
No empty scores.
No special scores.
No intra-tier rankings. Nothing is "at the top of A" or "at the bottom of S".
The absence of a score simply indicates that I haven't reviewed it yet. It is not a sub-F snub.
If I completed something, it cannot receive an F rating.
If I have any complaints, it cannot receive an S rating.
An A or B review must elaborate on my complaints.

Next Steps

For me the core of this is a settled matter. I'd like to support it, perhaps by fleshing it out with my own tooling like the sharing-friendly webapps that already exist.

Once a robust backlog of reviews have been ported over, I'd like to make a tier list view option for various pages on the site - that would probably be a good stepping stone to the aforementioned tooling.

The next frontiers are aggregation handling and preemptive assessments of unexperienced media.

Aggregation Handling

Aggregation has two big problems in this system:

Because there are no fractional scores, you can't visually represent averages in the same tile format.
It's unclear whether ranks should be weighted evenly. If a director made 1 S-tier movie and 1 F-tier movie, is it really fair to equate their average to someone who made infinite C-tier slop? I'd rather the S count for much more, but the extent is not clear.

Preemptive Assessments

Assessing media that has not yet been experienced is very important to me. It has direct connections to the Problems of Memory and Curation I briefly explained in this blog post:

2020-06-20

The Hard Problems

#philosophy#time#productivity

I prototyped additional categories representing a negative expectation, neutral or unknown expectation, and a positive expectation, but I have since removed them because they were not worth the cognitive and architectural overhead. I will revisit this in the future.