Analytical Cellular Pathology 35 (2012) 97–100 DOI 10.3233/ACP-2011-0029 IOS Press

97

Mitotic ﬁgure recognition: Agreement among pathologists and computerized detector

Christopher Malona,∗, Elena Brachtelb, Eric Cosattoa, Hans Peter Grafa, Atsushi Kuratac, Masahiko Kurodac, John S. Meyerd, Akira Saitoe, Shulin Wub and Yukako Yagib
aDepartment of Machine Learning, NEC Laboratories America, NJ, USA bDepartment of Pathology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA cDepartment of Molecular Pathology, Tokyo Medical University, Tokyo, Japan dDepartment of Pathology, St. Luke’s Hospital (St. Louis), Chesterﬁeld, MO, USA eInnovative Service Solutions Division, NEC Corporation, Tokyo, Japan

Abstract. Despite the prognostic importance of mitotic count as one of the components of the Bloom – Richardson grade [3], several studies [2, 9, 10] have found that pathologists’ agreement on the mitotic grade is fairly modest. Collecting a set of more than 4,200 candidate mitotic ﬁgures, we evaluate pathologists’ agreement on individual ﬁgures, and train a computerized system for mitosis detection, comparing its performance to the classiﬁcations of three pathologists. The system’s and the pathologists’ classiﬁcations are based on evaluation of digital micrographs of hematoxylin and eosin stained breast tissue. On ﬁgures where the majority of pathologists agree on a classiﬁcation, we compare the performance of the trained system to that of the individual pathologists. We ﬁnd that the level of agreement of the pathologists ranges from slight to moderate, with strong biases, and that the system performs competitively in rating the ground truth set. This study is a step towards automatic mitosis count to accelerate a pathologist’s work and improve reproducibility.

1. Introduction
Beginning with Greenough’s 1925 grading system [7], pathologists have attempted to quantify factors that provide a measure of an invasive breast tumor’s locality and prognosis. From the seven factors in Greenough’s system, Bloom and Richardson settled on three factors [3], and the 1991 Nottingham revisions gave more stable deﬁnitions to these variables [6]. The grades are widely used to inform the selection of high-risk treatments, through the information they provide about survival or the likelihood of distant metastasis.
∗Corresponding author: Christopher Malon, PhD, Department of Machine Learning, NEC Laboratories America, 4 Independence Way, Princeton, NJ 08540, USA. Tel.: +1 609 951 2594; Fax: +1 609 951 2482; E-mail: malon@nec-labs.com.

In light of the critical role of grading, many authors have investigated agreement between pathologists on individual components of the grade [2, 9, 10]. The level of agreement may be reported in Cohen’s Kappa statistic [5], which equals one in case of perfect agreement and zero in the case of probabililstically independent decisions. The range 0–0.2 is often considered as slight agreement, 0.2–0.4 as fair, 0.4–0.6 as moderate, 0.6–0.8 as good, and 0.8–1 as almost perfect.
Meyer’s study of agreement on Bloom-Richardson grading [9] involved groups of ﬁve to seven pathologists, with each group examining 10–23 patients’ slides of hematoxylin and eosin stained biopsy tissue through an analog microscope. Agreement on the overall grade was moderate, with κ = 0.50–0.59 for the various groups. The tubularity grade achieved stronger agreement than for the other components (κ = 0.57–0.83).

2210-7177/12/$27.50 © 2012 – IOS Press and the authors. All rights reserved

98 C. Malon et al. / Mitotic ﬁgure recognition

Agreement on the pleomorphism component was weakest (κ = 0.27–0.50). The range of agreement for the mitotic grade was κ = 0.45–0.67.
The NEC e-Pathologist Project aims to provide diagnosis support for anatomical pathology, by providing computerized image analysis for virtual microscopy. It has released modules for analysis of gastric tissue that are in use in a major commercial laboratory in Japan. A breast module for hematoxylin and eosin stained tissue is under development, to target the least two stable parts of the Bloom-Richardson grade. For pleomorphism, the system aims to produce output representing carefully measured statistics of nuclear shape. For mitosis, the system aims to provide a classiﬁer with consistent judgments.
Usage of a computerized mitosis detector could be the continuation of a line of efforts to devise more stable, prognostically useful guidelines for mitosis grading. The original Bloom-Richardson system directed that hyperchromatic nuclei be counted along with true mitotic ﬁgures. Cutoffs between grades were vaguely prescribed. In the Nottingham revision, pathologists were directed instead to avoid hyperchromatic nuclei, apoptotic ﬁgures, and pyknosis. Mitotic ﬁgures in prophase were no longer to be counted because agreement was low. Figures were to be counted in ten high power ﬁelds (25× or 40× magniﬁcation), and grading cutoffs were made to depend on the high power ﬁeld size. Each high power ﬁeld was to be taken from the tumor’s periphery.
The rigor of the grade given in clinical practice appears to vary widely. Comparing pathologists who performed a quick 30-second impression of ten high power ﬁelds to those who spent 2–3 minutes applying the rules for the WHO Mitotic Activity Index (MAI), Skaland [11] found that those who followed the MAI procedure gave grades that were prognostic with two more orders of magnitude in p-value than those who did the quick impression. Some see the need for even more concrete guidelines. Baak [1, 12] attempts to describe ﬁgures to be counted in image analysis terms (loss of nuclear membrane, presence of “clear, hairy extensions of nuclear material,” etc.,). One of the authors (AK) believes a complete decision tree of such image analysis rules could be drawn. However, another author (EB) believes that it may be impossible to apply even such speciﬁc guidelines by viewing just one focal plane, because some details of mitotic ﬁgures are recognizable only when focusing up and down with the microscope.

Whereas most previous investigations of agreement of mitosis have focused on the mitotic grade, we examine agreement on individual ﬁgures. In the largest previous such study we found [9], seven pathologists examined 43 potential mitotic ﬁgures. Average pairwise agreement was κ = 0.38.
2. Methods
In the present study, we asked three of the authors (AK, JM, and SW) who are pathologists actively signing out breast cases, to examine 4,204 potential mitotic ﬁgures. These ﬁgures were taken from 2,444 high power ﬁelds in 94 breast slides, stained in hematoxylin and eosin. Tissues were provided by Massachusetts General Hospital and Tokyo Medical University, and scanned on Hamamatsu Nanozoomer scanners. At full resolution (40×) the scanners afforded a resolution of 4.39 pixels per micron. Because a random selection of nuclei would include too many obvious non-mitotic ﬁgures, one author selected the candidates manually, intending to obtain ﬁgures that were mitotic or worth a closer look.
Each pathologist examined all 4,204 ﬁgures, answering the question “Is this ﬁgure mitosis or not?” with “Yes,” “No,” or “Maybe”. For each pair of pathologists, we considered the ﬁgures where both pathologists committed to a “Yes” or a “No,” and computed Cohen’s Kappa on this subset. In this way, ﬁgures where the digital image was inadequate for decision should be excluded.
The second part of our investigation concerned the performance of a computerized detector. As discussed above, the description of mitotic ﬁgures in terms of decisions about shape and structure may be complex and unclear. Translating those verbal directives into manually coded image analysis rules could be dangerous. Avoiding such a heuristic approach, we developed a detector based on machine learning.
In our system, a simple rule that extracts blobs representing nuclei of poss-sible mitotic ﬁgures (and many other nuclei, to be rejected later) establishes a set of candidates from the digital micrograph of the high power ﬁeld (HPF). Machine learning is applied in three phases. One phase applies a support vector regression [13], which remaps the color palette of the HPF to normalized values. The next phase is a convolutional neural network [8], applied at each extracted blob. The convolutional neural network contributes a feature to

C. Malon et al. / Mitotic ﬁgure recognition

99

a feature vector, which also contains many other measurements regarding the shape, color, mass, and texture of the blob and its neighborhood. In the ﬁnal phase, a support vector machine [13] uses the feature vector to classify the area around the blob as a mitotic ﬁgure or not.
Machine learning algorithms learn classiﬁcation rules by taking a set of training data where features can be measured and true classiﬁcations for each example are known. Using the training data, they ﬁnd parameters to be used for classiﬁcation by solving a minimization problem. The learned parameters can then be applied to new examples, where the features can be measured but the classiﬁcation is not given, to predict the correct classiﬁcation.
Applying machine learning thus requires that we divide our data into a set for training and a set for testing. We reserved 799 mitotic ﬁgures for testing and used the remainder for developing our algorithm. Figures for testing and ﬁgures for training never came from the same slides. This careful separation ensures that our test ﬁgures reﬂect tissues totally unseen during development.
Training the machine also requires that we assign one ground truth classiﬁcation to each ﬁgure in the training set, but we have three classiﬁcations–one from each pathologist. We aggregated the pathologists’ decisions using majority voting. Namely, ﬁgures with two or three “Yes” labels were taken as ground truth positive, and ﬁgures with two or three “No” labels were taken as ground truth negative. Figures for which there were neither two “Yes” verdicts nor two “No” verdicts, such as a set of labels “Yes,” “Maybe,” and “Maybe,” were excluded from the training set. The same procedure was used on the testing data set.
We ﬁnd that this protocol is useful for investigation purposes because it allows a fair comparison of the machine to pathologists. If the only ﬁgures in the data set were those for which all three pathologists agreed, there would be no fair baseline for the machine’s performance (trivially, each participating pathologist would perfectly predict all the ground truth labels).
3. Results
Table 1 shows pairwise agreement of our three pathologists, in cases where both committed to “Yes” or “No” decisions. Statistics of agreement are com-

Table 1 Pathologist agreement of mitotic ﬁgures

Pathologists
A/B A/C B/C

Yes/Yes
1352 2705 1506

No/Yes
4 20 756

Yes/No
789 172
15

No/No
102 83
461

Table 2 Statistics of pathologist agreement of mitotic ﬁgures

Pathologists
A/B A/C B/C

Cohen’s κ
0.13 0.44 0.39

Prevalence
0.56 0.88 0.38

Bias
−0.35 −0.05
0.27

puted in Table 2. The prevalence index and the bias index express attributes of agreement that do not affect Cohen’s Kappa [4]. The prevalence index describes the relative frequency of agreed positives versus agreed negatives. The bias index compares the frequency of positive/negative disagreements and negative/positive disagreements.
We ﬁnd that strong biases exist between pairs of observers. B is much more likely to reject a ﬁgure called mitotic by C or A than to count a ﬁgure rejected by C or A. A measurable but much less signiﬁcant bias exists between C and A: C is more likely to reject a ﬁgure counted by A than A is likely to reject a ﬁgure counted by C.
In Table 3, we examine how each pathologist’s vote predicted the ground truth labels in the test set, which were determined by the result of majority voting. This calculation is intended as a baseline for the binary classiﬁer trained by machine learning, which decides “Yes” or “No” for every candidate mitotic ﬁgure. The pathologists had the additional option of saying “Maybe” to a ﬁgure, so their binary classiﬁcation performance may be considered as a range, bounded by the performance levels if all “Maybe” decisions were changed to either “Yes” or “No”.
Observer C’s performance at ﬁrst looks surprisingly high, but all observers were advantaged over the machine by contributing a vote to the “majority label.” Recall from Table 1 that A and B disagreed on 35% of the cases where they both committed to “Yes” or “No”. In each of these cases, C’s vote actually deﬁned the majority label. As the tie-breaker in this protocol, it is naturally expected that he has high agreement with the majority label.

100 C. Malon et al. / Mitotic ﬁgure recognition

Table 3 Prediction of test ﬁgures where two pathologists agreed

Observer Majority label Observer Lower bound Upper bound Yes/Maybe/No Agreement agreement

A

Positive (726) 658/65/3

90.6%

99.6%

Negative (73) 15/36/22

30.1%

79.4%

B

Positive (726) 394/166/166

54.3%

77.1%

Negative (73)

0/0/73

100.0%

100.0%

C

Positive (726) 720/4/2

99.2%

99.7%

Negative (73)

2/2/69

94.5%

97.3%

Machine Positive (726) Negative (73)

462/0/264 1/0/72

63.6% 98.6%

63.6% 98.6%

The last line of Table 3 shows the machine’s performance. Compared to A, the machine performs much more strongly on negatives, but not as well on positives. Compared to B, the machine misclassiﬁes only one more negative ﬁgure, while falling within the bounds of B’s performance on positive ﬁgures.
We ﬁnd this result encouraging, considering several disadvantages of the machine. One disadvantage is that the machine was conﬁned to looking at a small box around each mitotic ﬁgure, whereas the pathologist could examine the entire HPF. Another is the lack of a special strategy for telophase ﬁgures. Each blob in a telophase ﬁgure is regarded separately (although the second blob is visible). Features considering the relationship between the two nearby blobs should improve the performance further.
Although the range of Cohen’s Kappa for pathologists on individual mitotic ﬁgure recognition is perhaps not surprising, given its range on grade-level agreement, we were surprised to ﬁnd strong biases. The biases suggest that different pathologists interpret grading guidelines differently. We hope that the development of a computerized mitotic detector may be one step towards the establishment of more stable tissue grading.
References
[1] J. Baak, Mitosis counting in tumors, Human Pathology 21(7) (1990), 683–685.
[2] J.P.A. Baak, E. Gudlaugsson, I. Skaland, L.H.R. Guo, J. Klos, T.H. Lende, H. Soiland, E.A.M. Janssen and A. zur Hausen, Proliferation is the strongest prognosticator in node-negative breast cancer: Signiﬁcance, error sources, alternatives, and comparison with molecular prognostic markers, Breast Cancer Res Treat 115 (2009), 241–254.

[3] H.J. Bloom and W.W. Richardson, Histological grading and prognosis in breast cancer; a study of 1409 cases of which 359 have been followed for 15 years, Br J Cancer 11 (1957), 359–377.
[4] T. Byrt, J. Bishop and J. B. Carlin, Bias, prevalence and kappa, Journal of Clinical Epidemiology 46(5) (1993), 423–429.
[5] J. Cohen, A coefﬁcient of agreement for nominal scales, Educational and Psychological Measurement 20(1) (1960), 37–46.
[6] C.W. Elston and I.O. Ellis, Pathological prognostic factors in breast cancer, i. the value of histological grade in breast cancer: Experience from a large study with long-term follow-up, Histopathology 19(5) (1991), 403–410.
[7] R.B. Greenough, Varying degrees of malignancy in cancer of the breast, J Cancer Res 9 (1925), 425–463.
[8] Y. Le Cun, L. Bottou, Y. Bengio and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86(11) (1998), 2278–2324.
[9] J. S. Meyer, C. Alvarez, C. Milikowski, N. Olson, I. Russo, J. Russo, A. Glass, B.A. Zehnbauer, K. Lister and R. Parwaresch, Breast carcinoma malignancy grading by Bloom-Richardson system vs proliferation index: Reproducibility of grade and advantages of proliferation index, Modern Pathology 18 (2005), 1067–1078.
[10] P. Robbins, S. Pinder, N. de Klerk, H. Dawkins, J. Harvey, G. Sterrett, I. Ellis and C. Elston, Histological grading of breast carcinomas: A study of interobserver agreement, Hum Pathol 26(8) (1995), 873–879.
[11] I. Skaland, P.J. van Diest, E.A.M. Janssen, E. Gudlaugsson and J.P.A. Baak, Prognostic differences of world health organization-assessed mitotic activity index and mitotic impression by quick scanning in invasive ductal breast cancer patients younger than 55 years, Hum Pathol 39(4) (2008), 584–590.
[12] P. van Diest, J. Baak, P. Matze-Cok, E. Wisse-Brekelmans, C. van Galen, P. Kurver, S. Bellot, J. Fijnheer, L. van Gorp, W. Kwee, J. Los, J. Pe-terse, H. Ruitenberg, R. Schapers, M. Schipper, J. Somsen, A. Willig and A. Ariens. Reproducibility of mitosis counting in 2,469 breast cancer specimens: Results from the multicenter morphometric mammary carcinoma project, Human Pathology 23(6) (1992), 603–607.
[13] V. Vapnik, Statistical learning theory, Wiley (1998).