Cross-posted from Paul Schlichtman – Arlington School Committee blog.
The new teacher evaluation system, adopted in Massachusetts, is an important tool in our efforts to improve the quality of teaching and learning in our public schools. We have long needed to replace the cursory system of checking a few boxes labeled “satisfactory,” replacing it with a thoughtful series of rubrics based on essential components of high-quality instruction.
As an evaluator, I have worked very hard to build trust and understanding on the part of my staff. I have encouraged them to set ambitious goals for themselves and their students. I have set high expectations, including the understanding that not everyone is proficient in all aspects of the art of teaching.
Unfortunately, all of this hard work is in jeopardy, as the Supervisor of Records in the Secretary of State’s office has ruled that the Boston Public Schools must release teacher ratings aggregated by school. The Boston Globe requested the data, and intends to publish it as a measure of the quality of a school.
This ruling is laced with unintended and harmful consequences. As the principal of a small school, I know that one or two ratings of less than proficient would certainly lead to speculation as to which teacher received the low rating. It would be impossible to give a single teacher an unsatisfactory rating without that lone rating pointing to the teacher who received that evaluation, a disclosure that would undermine the confidentiality of individual evaluations, a confidentiality that is protected under the public records law.
Publishing aggregate scores will lead to the possibility that evaluators will be influenced by the publication of scores. Will evaluators inflate the results in order to give the appearance of a better school? Would there be an incentive to rate educators more harshly in order to give the appearance of holding staff to higher standards? Evaluators need to be insulated from external pressures that would potentially undermine the validity of the evaluation.
It is particularly troubling that Boston, with a plethora of small schools, isn’t challenging the Secretary of State’s decision through the courts. By releasing the school-by-school ratings, Boston would be setting a precedent that makes it more difficult for other districts to resist.
Releasing aggregate ratings at the school level was never a goal of the teacher evaluation system. It is not an effective metric for the quality of a school, a measure that is effectively reported through the state’s school accountability system. It will not benefit any public purpose, but would interfere with an evaluator’s efforts to promote continuous improvement among educators.
As an educator, and not an attorney, I don’t know what we need to do to overturn this ill-conceived ruling by the Secretary of State. I don’t know if the Boston Teachers’ Union has the standing or ability to request judicial review of this decision. However, as the finding is based on an interpretation of the public records law, one solution is legislative. I intend to contact my state representative and state senator in the morning, asking for legislative action that would block the release of school-level aggregate evaluation scores, and would allow us to implement the new evaluation system in a rigorous and trusting context.
Arlington School Committee
2004 President, Massachusetts Association of School Committees
I don’t see anyone’s confidentiality being violated by the release of Aggregate Information, and as a fan of Nate Silver, I believe that having historical information, even in the aggregate form, on which to compare future aggregate scores would be a useful tool in determining if a school is gaining or loosing ground in the aggregate.
This is part of the reason why basic understanding of Math is important. It is not just for figuring out stupid sports related numbers, sometimes, it is necessary for understanding things which are really important.
Further, the fact that this may cause some people to speculate upon which teacher might be below average is a poor and pathetic arguement for keeping this information hidden from people, in my opinion.
I think that individual teacher evaluations, aggregated, tell us absolutely nothing about whether the school is gaining or losing ground. Michael Jordon once scored 63 points in a game that his team lost.
And that’s why you should fear the indiscriminate release of information under a false pretense of knowledge: There’s nothing to be gained from this knowledge in this form and much to be lost by false and feeble interpretations of it…
… the ‘exemplary’ teacher who is assumed ‘needs improvement’ purely on the basis of speculation.
Individual teacher scores? That seems really intrusive to me.
Publishing individual scores would be intrusive. The implication in Mr. Schlichtman’s post is that it’s not individual scores being released.
But I don’t get what he wants. If they’re not aggregated by school … then what?
Mark L. Bail says
data, in the aggregate or not, is that people will automatically accept them as more significant than they are.In a (dubious) effort to be transparent, the publication of this material will instead cloud the issue and lead to bad policy decisions.
The common sense thought is that all data is good, but it’s not, and once a metric is established, we’re stuck with it. People will attach meaning to it and bad decisions will be made.
The other thing about teacher evaluations is that they aren’t set up to measure, but to measure and prod. There are 4 scores of competency, but large groups of people fall into a bell-shaped curve, which is better depicted by 5 categories. The same is true with MCAS. A person can’t score a 3.5 , for example. The result is a measure that is much less than precise. Yet huge policy decisions will hinge on them.
because some people, mostly republicans and so-called independants from Waltham who are just too F’N stupid to realize that climate change is real and actually occurring around us.
Perhaps it is just me, but I choose not to base my decisions on whether a news story is worth publication, based upon the stupidest people in the room. I for one, understand bell curves and the difference between a mean and a medium and what it means when a poll has a certain percentage margin of error.
I will choose enlightenment over self-iomposed ignorance every time.
Mark L. Bail says
You assume the data are valid. As Kirth rightly notes, I don’t. You assume the data are enlightening. I’m disagreeing.
What are you going to learn when you find out the percentage of teachers rated Unsatisfactory, Needs Improvement, Proficient, and Exemplary? Nothing. You won’t know what they mean. You’ll get thumbnails descriptions, but you won’t know how they translate in classroom. The haters are gonna hate, but even you won’t be able to understand the data because those categories are arbitrary. What’s the difference between Needs Improvement and Proficient? You won’t really know.
Remember when Meg Campbell, the Boston School Committee committee member, said she couldn’t believe preliminary evaluations were too high? What’s it going to mean if most teachers are rated proficient? There’s something wrong with the evaluators or unionized teachers have pulled another fast one.
At the moment, education is being eaten alive by bogus statistics; MCAS scores, SAT scores, student growth percentiles, value-added measurements. I’d prefer not to see another one in the name of nominal transparency and enlightenment.
BTW, was it Theresa Caputo who was mean or Alison Dubois?
“independents” from Waltham who are just too F’N stupid …
Mark L. Bail says
recipients of the information. It’s the information that’s flawed.
“…people will automatically accept them as more significant than they are.”
This sounds like your issue is with the validity of the data, not with its being publicly aggregated, but these evaluations are the product of “a thoughtful series of rubrics based on essential components of high-quality instruction.”
I also do not necessarily accept that once a metric is established, we’re stuck with it. Established metrics get supplanted all the time by better metrics.
As for prodding and the bell curve, all employment situations include prodding, and if anyone understands the bell curve it’s educators and their administrators.
Mark L. Bail says
Educational metrics tend to stick and pile up into new incarnations.
As far as making it publicly available, to me, it’s a matter of the law.
It’s ironic to me that we score MCAS essays on the same scale as the typical employee evaluation, which is the source, as I understand it, of MCAS rubrics. When you combine measurement with prodding, measurement suffers.
… the evaluations were designed NOT TO BE MADE PUBLIC… So the data is NOW being spindled in a manner that was never envisioned when the ‘thoughtful series…’ was first undertaken. The public aggregation of the data invalidates it. In turn, as the OP points out, this could have the effect of changing the POV of future evaluators and evaluations.
The situation is a near complete lose-lose as I can imagine.
Glenn Koocher was a mentor who taught me to be unafraid of raising hell on the student school committee, and I also remember the Day on the Hill I attended quite vividly. It was either in 2004 or 2005, so our paths may have crossed. Keep up the good work. MASC is an organization that is quite unique to the Commonwealth, as are school committees in general, and I am glad to see you advocating on behalf of teachers.
While I do share a concern overall with publishing the scores, I agree with the poster on this aspect:
But on the other hand, Boston is not helping.
Exemplary 13.5% (520)
Proficient 79.5% (3,065)
Improvement 5.8% (224)
Unsatisfactory 1.2% (48)
I find those numbers hard to believe in any job evaluation, anywhere, at any time. What can anyone learn from those kind of evaluations? I’m not talking about the number being public either. Seriously, if you are working to improve schools and you are evaluating performance, this helps you?
93% is Exemplary to Proficient? There are some schools that don’t need to improve on anything, they have established perfection. Everyone from teachers, those who work in the office, administration, they have mastered perfection. 0.0 on Needs Improvement.
Someone needs to “re-evaluate” their evaluation process in order to get something substantive.
It mean adaquate (not perfection). Which I can believe is probable given that this number in its aggregate form probably reflects on the department head or principal.
Needs Improvement and, to an even greater extent, Unsatisfactory, mean that you are about to be fired. If you are rated NI, you can only go up to Proficient or down to Unsat. You can’t stay at Needs Improvement. If you are rated Unsatisfactory, you must go up to Proficient or be dismissed. If 7% of the workforce is fired each year, no one can make it through a career. Furthermore, if you think for half a second that these ratings, given by different individuals in different schools, are statistically reliable and valid, I’ve got a bridge to sell you in Brooklyn.
But I am wondering why no one is talking about the fact that African American teachers were rated unsatisfactory (on a track to be dismissed) at a rate FIVE TIMES higher than white teachers?
I don’t see how this evaluation helps. It is very easy to identify your bottom 5%, or top 5% for that matter. I don’t see how anything is learned to improve schools from this process. If someone presented this to me in a meeting, I would say: so what did we learn?
Plus, eberg’s note, why is race even categorized. I’m removed from the public sector, but is this how things are normally broken down? I really, really don’t like this being public the more I think about it. That’s unless things are so bad that you have to blow the lid off things to improve it.
by Richard Stutman when the data first came out.
… and ‘exemplary’ and ‘proficient’ mean something other than ‘in relation to someone else…’
This critique would make sense only if teachers performance always mapped equally to school performance which, also, always mapped equally to student performance. This has never been the case and never will be the case.
I find it perfectly acceptable to have a cadre of people rated, grossly, thus and fitting these numbers: 93% rated proficient or exemplary is a goal of most organizations. If, as you allege, the mere ordering of numbers in this manner is sufficient proof of their inaccuracy, then you are as much as saying this goal is impossible. If that’s the case, the question is begged, why evaluate at all…?
Furthermore, you conflate teacher performance with school performance. Why? I would say that greater than 93% of the Boston Red Sox players would rate ‘exemplary” every time… never mind ‘proficient’.,, but that doesn’t mean that the team wins every game: they have not, similarly, ‘established perfection’. That’s because the game itself is hard. In fact, continuing the analogy: what if the hurdles to overcome are every bit as forceful and impactful as the exemplary players? That is to say, exemplary players go up against exemplary players in a difficult game… but the exemplary teachers aren’t going up against others who are just as good as they… but rather they face obstacles, remorseless and uncaring, but just as powerful than they… The game is hard.
If the numbers are accurate (if not precise), and we have no reason to suspect otherwise, then we should congratulate our teachers for a job well done, help them to do even better and then congratulate ourselves on getting one component of the system about right… and then we ought to move on to solving real problems instead of trying to squint sideways and see a problem where none exists. Teachers are not the problem.
Please don’t take offense.
I’ve been through enough performance evaluation processes and re-engineering processes over the years. My concern is not with the teachers, it’s the evaluation.
The question is how do you improve. No one is perfect, to say so in an evaluation is a problem. The goal in any organization should be to provide feedback to help employees grow. Based on the very limited information provided, but with a great deal with experience, these summary numbers would have me question the evaluation. What in your mind is actionable, how does this help people grow, how does this improve the school?
Mark L. Bail says
process everyone involved thought was stupid. The new evaluation was the result of work by teachers unions and other stakeholders. The evaluation itself is designed to provide decent feedback to teachers and screening problems. The emphasis is on improvement, not on termination.
Unfortunately, there is a neurotic compulsion in today’s society to quantify everything and then want it to go up. We do this at the school level by testing kids, aggregating the scores, and then pretending they tell us something that we don’t already know.
Bear in mind, the statistics used are always simple and unrevealing. Measures of variation play know role. No standard deviations or ANOVA nothing that would help us understand what these statistics actually mean. No one knows if our tests like MCAS actually measure anything.
The goal is to find a system to declare that unionized public school teachers are an obstacle to “excellence” and the only step is privatize the schools and break the unions. That way the billionaires who fund the groups pushing this can make money off of new private/charter schools by vacuuming tax revenue, and underpaying a revolving staff.
Oh, were you looking for the rationale?
mark-bail had this to say
The more time I have to think about it, the more I’m not in favor of this being public information. It is in its core, private. I don’t know I would have a constructive dialog with anyone when data being public over our heads. it goes back to my original comment quoting the poster, I’ll quote him again:
Also, the trust factor, based on your post, there is very little of it.
With the scores out a whole 20 hours, I’m already hearing statements in town about X school being the best because it had the most exemplary teachers. Those of us working on the system see that school as problematic because it had an incorrect working definition of the term “exemplary”. Thus, when that numbers goes down by half next year, we can expect to hear about the “Crisis” that must be occurring in that building.
… ‘fantasy football’, but in reverse: individual effort in a team setting is aggregated to provide a wholly unreal picture of performance, and then, pitted against an entirely different, but also wholly unreal, opponent. Oh, and nobody wins.
From a legal point of view, there is no question that once the reports exist the reports are public records, and cannot be withheld. If the reports named individuals, then it is likely that they would not be public records.
I have trouble with the argument that if the methodology is imperfect that the reports should be kept secret. The better remedy is a full explanation of the methodology and its limitations so that the limits of the evaluations can be understood. This also would create pressure to come up with better methods of evaluation. The problem of imperfect measurement comes up in all fields, not just education.
Mark L. Bail says
I wish it were that simple.
I agree with you that it is a legal question, and therefore, a moot point.
Using the garbage out, garbage in system. This is not, as mooted elsewhere, equal to publishing climate control data. This is equal to publishing the data from the first trial by a newly hired TA who doesn’t know what he’s doing in a new lab.
This system has been jury-rigged from the start, with the state Department of Education delivering components several months late (it just delayed the implementation of test scores in this system by a year, then it canceled the delay, then it confirmed it, then it considered delaying even more, then it changed its mind). The mode at DoE on this is to do something, then think about how to do it. In this way, these numbers are suspect. Not because the numbers are bad, but because they are established in a suspect way. There has been a lot of confusion coming from DoE about what constitutes exemplary and what doesn’t. “Needs improvement” doesn’t mean “imperfect”…except some people say it does. “Most of the time” means 90%, 50.1%, or something in between…depending on who you ask.
I see this data as a window on how this system is being implemented. To believe it actually says anything about the quality of the workforce of BPS is doubtful to say the least.