A Scorecard for Reviewing
Motivation, design, and feedback on the reviewer scorecard program at IEEE InfoVis 2016 and 2017.
Note: This is a new import of an old blog post from 2017.
At IEEE InfoVis 2016, we created the first version of a “reviewer scorecard”: a personalized review summary for each person who participated in the InfoVis review process during that cycle. This year, we sent out the second version of the scorecards for IEEE InfoVis 2017 in early August. Here I want to discuss the background and the motivation for the reviewer scorecard as well as the feedback I’ve received on it.
Motivation
Serving as InfoVis papers co-chair in 2016 and 2017 has given me a lot of insight into the overall review process of the conference. All in all, each of these two years yielded close to 700 reviews from more than 250 individual reviewers for a total of around 1,350 reviews (I say “around” because not even papers chairs get to see all reviews due to conflicts of interest). The majority of these reviews I have at least skimmed (often just checking that they are appropriate), and a reasonably large portion I have read completely.
My overall impression is one of amazement: amazement at the high quality of these reviews, and amazement at the huge amount of time that reviewers invest in the voluntary and thankless job of reviewing. One of the reasons that IEEE InfoVis is seen as the premier conference in this field is no doubt the quality of its review process. I am pretty sure that most would-be InfoVis authors will, if nothing else, agree that the reviews they received were at least of high quality, regardless of whether their papers were accepted or not.
However, from my experience both as an author and as a papers chair, I also think that InfoVis reviews can be pretty harsh. I am sure most people will agree with this, too, regardless of whether their papers were accepted or not. Because the conference is the premier one in the field, people tend to set a very high bar of acceptance when reviewing for it. The fact that most reviewers are also authors with work under review at the conference contributes to this effect. If nothing else, this probably means that reviewers will (consciously or subconsciously) compare the work they are reviewing to their own, a comparison that may not be entirely unbiased. In the worst case, I suspect some reviewers may feel that rejecting other people’s work may give their own work a better chance. (To be perfectly clear, these two notions are not appropriate and should be avoided, but I still think they happen.)
As I have argued elsewhere, strict reviewing is fine as long as it is fair and equitable. After all, there can only be one top conference. High review quality will ensure that the correct papers are accepted, even if the number of accepted papers is low. One way to ensure high review quality is to provide feedback mechanisms where reviewers can calibrate themselves with other reviewers. InfoVis already does this on a tactical level by revealing the ratings and reviews for one paper to all reviewers of that paper. Seeing how other reviewers rate a submission based on its features and flaws is an important part of honing your reviewing skills. However, I wanted to go even further: how do you calibrate your reviewing against the entire reviewer pool for the conference?
Inspiration
Providing this kind of reviewer feedback on a large scale is not a new idea. I am not fully aware of all of the prior efforts-I am sure there are several-but the main inspiration for our reviewer scorecard came from a chart that was sent out to reviewers as part of the EuroVis conference (I actually don’t remember which year, but I am guessing 2011). This chart showed a scatterplot of the means and standard deviations of reviewer disposition (see below) for all reviewers, i.e. a measure of how “nice” versus “nasty” each reviewer was in their scoring. By showing this information for the entire reviewer pool and then informing each individual reviewer which dot in the plot they were, this chart helped reviewers see their rating in context. I personally found it very useful to gain an understanding of my own disposition as a reviewer (as it turns out, I tend to lean on the nice, or positive, side), and this chart stuck with me ever since.
Experimentation in peer review is not uncommon. Another approach to improving calibration is the approach that alt.chi used to employ where anyone could freely review any submitted paper, and reviewers were encouraged to identify themselves in their reviews. In fact, the latter (divulging reviewer names to the authors) is called “open reviewing” and has been employed in various other conferences, including in some of the machine learning ones (such as ICML and NIPS workshops). Our goal with this effort was to maintain the existing InfoVis peer review process-including the anonymity of the reviewers-and merely improve the calibration once the process had ended.
The InfoVis Review Process
In 2016 and 2017, InfoVis employed a two-round single-blind or double-blind review process, i.e. where reviewers remained anonymous but authors were free to choose whether or not to anonymize their submission. Each submission was assigned to two program committee members (a primary and a secondary reviewer), each of which invited an additional external reviewer. Each of these four reviewers wrote a proper review of the submission. The primary also wrote a summary review that aggregated the collective feeling of the panel.
Individual reviews consisted of a score from 1 (lowest) to 5 (highest), including half points, an expertise rating from 1 (no or passing knowledge) to 3 (expert), and a free-text narrative. Primary reviewers further provided an overall score (same scale as the individual score) and a free-text review summary.
For more details about the InfoVis review process, please refer to our previous blog post on this topic from 2016. I recently also put together a comparative summary of the InfoVis 2016 and 2017 review processes with descriptive statistics from both years; see that article for details.
Scorecard Components and Metrics
How do you best characterize a reviewer’s collected reviews in context of the entire pool while maintaining anonymity and confidentiality? In the two first iterations of the InfoVis reviewer scorecard, we employed the following metrics:
- Number of reviews: Many reviewers reviewed more than one submission during a single year. For this reason, all of the other metrics are calculated across all of the reviewer’s reviews for a particular year.
- Score: The (average, minimum, maximum, and standard deviation) score the reviewer awarded to his or her submissions.
- Expertise: Their self-reported expertise rating.
- Review length: The number of characters for the reviewer’s free-text narratives.
- Score disposition: How far away was the reviewer’s score from the average score of each submission? Given that the average score for a submission is calculated by weighing together all four reviews for that submission, positive disposition means that the reviewer tends to rate submissions higher than the average (i.e., the reviewer is “nice”), whereas negative disposition means that the reviewer tends to rate submissions lower (i.e., the reviewer is “harsh”). The descriptive statistics across all of a reviewer’s assigned submissions gives an understanding of the overall trend for that reviewer.
- Score distance: Similar to the disposition, except the distance captures the absolute difference from the average. This eliminates the possibility for a misleading disposition metric due to a reviewer that oscillates between being nice and harsh. A high distance basically means that the reviewer tends to be somewhat of an outlier or a “contrarian”, for better or worse, since the reviewer assigns scores that are divergent from other reviewers.
We used these metrics to create a scorecard per reviewer that was then distributed to each individual. The scorecard reports the mean, standard deviation, minimum, and maximum for both the individual reviewer as well as for the entire reviewer pool. Furthermore, the scorecard also includes histograms for each metric where the individual can see their score in context with the entire distribution for the whole reviewer pool. Finally, the scorecard includes a scatterplot with the mean and standard deviation of the score disposition, giving an overview of all reviewers.
Examples of each chart can be found in my prior blog post on the InfoVis review process by the numbers. Here is an example of what the scorecards actually looked like; one is for the “Anonymous” reviewer, i.e. all of the reviewers I was in conflict with for InfoVis 2017.
Implementation
The current InfoVis scorecard is implemented as a Jupyter Notebook in Python that uses a bunch of the standard Python mathematics and statistics packages, such as pandas (for plotting), numpy (for numeric operations), and matplotlib (for generating charts). The script uses output that PCS (Precision Conference, the web-based conference review management system that InfoVis uses) can produce for papers chairs that contains all of the reviews, one per row in a CSV spreadsheet, as well as a list of the reviewer database. Of course, as with anything that PCS does, chair conflicts are preserved, which means that all of the papers that the chair who downloads the datafile is conflicted with will have the reviewer “Anonymous”, even if the full review text and score is available.
Running the script is as simple as opening the Jupyter Notebook and executing it. Intermediate results will show summary charts of the overall scores, expertise, review length, and disposition/distance for the entire reviewer pool. It will also generate scorecards as HTML files, one per reviewer, and give them a unique UUID (Universally Unique Identifier) as a filename. It also spits out a CSV file mapping reviewer names and emails to report filenames. The reports are intended to be uploaded into a public web directory. Reports deliberately do not contain names of the individual it is intended for, to minimize the risk of breaching confidentiality, and the web directory should have directory listings turned off to avoid people simply browsing the entire directory. A second script allows for sending out an email to each of the reviewers named in the list of reports along with the unique URL to their personal report.
A Vision for Data-Driven Reviewing
Since the reviewer scorecard uses standard output from the PCS system, it should be possible to deploy this idea to other conferences that also use PCS with a minimum of adaptation. In 2016, the InfoVis scorecard was informally applied to data from the CHI 2017 conference, even if no scorecards were sent out to reviewers. I am exploring the possibility of doing this for the upcoming CHI 2018 conference, hopefully also resulting in the scorecards being sent out.
In general, our experiences from deploying the InfoVis review scorecard clearly show that review data (and probably ) be used to improve a conference. The feedback we’ve received from reviewers has so far only been positive. A lot of people have emailed me directly and lauded our efforts at improving review quality. A few have told me that they have been surprised at seeing their own numbers in context with the rest of the reviewer pool, and are looking to make a change. Some people involved with organizing other conferences (such as CHI) have reached out to see if the scorecard can be deployed in their settings.
Of course, another question is whether the scorecard actually has had measurable effect on review quality. Two years certainly does not provide enough to say whether this is true, so more time and more data is necessary. The small changes between 2016 and 2017 (average review length grew, for example), reported in my article on the InfoVis review process by the numbers, could very well be within the margin of error rather than systemic improvements. In other words, answering this question will take more time than just our first two years.
Having said that, our use of the data has so far been restricted to helping reviewers self-calibrate and improve their own reviewing practice, but this entirely ignores another possibility: letting editors and program committee members use review data to select reviewers, and potentially identify “bad” reviewers. In other words, data over time of reviewer quality and performance would serve as useful data points for when a papers chair, an associate editor, or a program committee member is deciding which reviewers to invite (or, in fact, when appointing the editors, PC members, or chairs themselves). This is a little more controversial-for example, people’s reviewing abilities clearly improve over time, and my experience is that the more senior a reviewer gets, the more mild-mannered, balanced, and inclusive their reviews become-but it would make for a more systematic approach to promoting good reviewers and pruning bad ones. Papers chairs and editors still maintain this type of knowledge, but in a more informal way as part of the institutional memory of the venue.
Conclusion
I’ve presented a high-level overview of the reviewer scorecard that I implemented for the InfoVis 2016 and 2017 conference. The goal of this article is to give a little more insight into these metrics, their rationale, and the feedback I’ve received on it beyond the motivation given in the scorecards themselves. I have also outlined a future vision for how data-driven reviewing can help reviewers themselves, as well as making identifying good versus bad reviewers more effective and based on actual reviewing data.
Originally published at https://sites.umiacs.umd.edu.