Feds question high scoring principal evaluations

Feds question high scoring principal evaluations

(Pa.) Just five percent of school leaders evaluated under a new performance tool in Pennsylvania were found to need improvement, and none received a failing grade.

Although the new evaluation system was found by a federal review team to include a proper array of indicators to distinguish leadership qualities, the fact that so many principals and assistant principals received high scores raised a red flag from auditors.

“The concentration of scores in the top two performance levels contrasts with prior research that has revealed clear differences in the contributions principals make to student achievement growth,” researchers from the U.S. Department of Education’s Institute of Education Sciences said in a report out this month. “Supervisors may thus have rated their school leaders too positively.”

The report is one of the first to take a close look at the evaluation tools being implemented across the country as part of the No Child Left Behind waivers issued by the Education Department over the past two years.

The Obama administration has granted relief to 42 states and the District of Columbia from the most pressing sanctions and requirements of NCLB. One of the key conditions of the waiver, however, is that states as well as local educational agencies covered by the agreement develop and implement new evaluation systems for school leaders that take into account student achievement growth and the quality of principals’ leadership practices.

According to the Education Institute, however, there is little to document the reliability or validity of all the new tools. Indeed, one recent study on the issue found 63 of 65 tools being used in the U.S. had nothing to show whether they accurately measured performance.

The Pennsylvania report, written by a team from Mathematica Policy Research, a think tank based in Princeton, N.J., noted that legislation approved in 2012 created the Framework for Leadership, which rates school leaders on 19 practices as well as academic growth.

The practices are organized into four categories: strategic/cultural leadership, systems leadership, leadership for learning, and professional and community leadership. The state piloted the program the past two years with plans to roll it out statewide this year.

Under the system, half of a principal’s or assistant principal’s score is based on a supervisor’s assessment of leadership skills. The other half is based on measures of student achievement.

The IES report analyzed the system using three questions:

• Internal consistency — Did different parts of the evaluation tool come to similar conclusions about a school leader’s effectiveness? “This is desirable because the leadership qualities captured by different parts of the FFL (Framework for Leadership) are supposed to reflect an overall capability to improve student achievement through effective school leadership,” the report said.

• Score variation — How much do scores differ across school leaders? “Score variation is necessary for the FFL to differentiate between high- and low-performing school leaders, a basic goal for any evaluation tool,” researchers said.

• Concurrent validity — How much do scores in a given year correlate with school leaders’ contributions to student achievement growth in the same year? This, they said, would indicate that the evaluation scores reflect leadership practices that contribute to raising student achievement.

Key findings:

This interim report provides findings and considerations based on the pilot evaluation data from 2012/13 for 336 principals and 69 assistant principals in Pennsylvania:

• The full evaluation tool had good internal consistency for both principals and assistant principals. School leaders who earned higher scores in one category of leadership practices tended to earn higher scores in the other categories.

• Most school leaders received scores of proficient or distinguished for specific leadership practices. Supervisors rated the performance of both principals and assistant principals as proficient or distinguished 95 percent of the time and as needing improvement in the remaining five percent. The most common rating was proficient (70 percent for principals and 79 percent for assistant principals). Supervisors rarely assigned a failing rating: only two principals received a failing rating on a component, while no assistant principal received a failing rating.

• School leaders with larger estimated contributions to student achievement growth did not, on average, receive higher scores than school leaders with smaller estimated contributions to student achievement growth.

Interim conclusions and suggestions:

  • The findings from the 2012/13 pilot reveal both strengths and weaknesses of the evaluation tool. The good internal consistency of the full FFL suggests that it is based on a coherent definition of leadership quality. However, the concentration of scores in the top two performance levels contrasts with prior research that  has   revealed clear differences in the contributions principals make to student achievement growth. Supervisors may thus have rated their school leaders too positively. This possibility is substantiated by the absence of a positive correlation between school leaders’ FFL scores and their contributions to student achievement growth.
  • This lack of correlation between scores and school leaders’ contributions to student achievement growth does not necessarily make FFL  scores a less valid measure of school leaders’ effectiveness than scores from other tools. To date, there is no robust evidence that any current school leader evaluation tool is associated with school leaders’ contributions to student achievement growth.
  • Nevertheless, these findings suggest that more evidence is needed on the validity of using scores to identify effective and ineffective school leaders. The Pennsylvania Department of Education may need to consider additional measures of school leaders’ performance, such as anonymous ratings by teachers that may be less susceptible to excessive leniency. Even if the additional measures do not factor officially into evaluations, they can be compared with FFL scores as a check on whether supervisors are being too lenient in assigning ratings in the FFL.