Now that states and the federal government are attaching high stakes to standardized tests, these tests are coming under increasing scrutiny. They don't appear to be holding up well to this additional scrutiny

A top New York state education official acknowledged Wednesday that the mounting number of errors found on this year's math and English tests has eroded public trust in the statewide exams.

"The mistakes that have been revealed are really disturbing," New York State Board of Regents Chancellor Merryl Tisch said at a Midtown breakfast sponsored by Crain's New York Business.

"What happens here as a result of these mistakes is that it makes the public at large question the efficacy of the state testing system," said Ms. Tisch, whose board sets education policy for the state.

Still, Ms. Tisch said testing experts have told state officials that the exams are valid and can be used to evaluate students and, in some cases, teachers.

Over the past several weeks, a series of errors by test-maker Pearson PLC have come to light, ranging from typographical mistakes to a now-infamous nonsensical reading passage about a pineapple. This is the first year of a five-year, $32 million contract the state awarded to Pearson, which also publishes textbooks.

To date, 29 questions have been invalidated on various third- through eighth-grade math and English tests, which are used in New York City to determine whether students are promoted to the next grade.

Pearson didn't return a request for comment.

Mistake riddled tests are not the only problem being highlighted

Is it okay to ask a child to reveal a secret? Richard Goldberg doesn’t think so. Goldberg, the father of 8-year old twin boys, was dismayed to learn his third-grade sons were asked to write an essay about a secret they had and why it was hard to keep. The unusual question, which Goldberg called "entirely inappropriate" was on the standardized tests given to public school students in the third through eighth grade every spring.
[...]
The question will not, however, appear on any future versions of the test, Barra said. "We’ve looked at this question in light of concerns raised by parents, and it is clear that this is not an appropriate question for a state test," Barra said.

Increasingly, calls are being made to make these tests public, so they can be fully vetted.

I learned that the tests themselves are being kept secret because the state Department of Education and Pearson, their test development contractor, wrote strong confidentiality provisions into the contract. My understanding is that this was so that they both could reuse test questions in the future. In order for the questions to be reusable, they have to be kept secret, otherwise students could prep too easily for the tests, and Pearson’s other customers would be able to get the tests from the public domain.

We only know about the gaffes because students exposed them. Educators have been sworn to secrecy. The Education Department has emphasized their concerns about test prep, but to me the secrecy seems rooted in economics: Secrecy saves New York on future test development costs and makes it easier for Pearson to re-sell the questions it created for New York (at New York taxpayers’ expense) in other states.

Two things strike me as odd about this. First, it’s uncommon to keep tests completely secret after the fact of their administration. Letting people see the test is a basic part of education.

The purpose of testing is to measure how well a student knows subject matter and to identify what areas need work. If the only thing one knows about a child’s performance on a test is his grade, and one can’t review the actual test, the test is pedagogically useless and can only serve a punitive purpose.

If the broader community of parents, educators and researchers can’t see tests, then we have no way of judging the connection between them and curricula or how to help our children.

A paper by the National Board on Educational Testing and Public Policy titled "Errors in Standardized Tests: A Systemic Problem" found

This paper contains a sizable collection of testing errors made in the last twenty-five years. It thus offers testimony to counter the implausible demands of educational policy makers for a single, error-free, accurate, and valid test used with large groups of children for purposes of sorting, selection, and trend-tracking.

No company can offer flawless products. Even highly reputable testing contractors that offer customers high-quality products and services produce tests that are susceptible to error. But while a patient dissatisfied with a diagnosis or treatment may seek a second or third opinion, for a child in a New York City school (and in dozens of other states and hundreds of other cities and towns), there is only one opinion that counts – a single test score. If that is in error, a long time may elapse before the mistake is brought to light – if it ever is.

This paper has shown that human error can be, and often is, present in all phases of the testing process. Error can creep into the development of items. It can be made in the setting of a passing score. It can occur in the establishment of norming groups, and it is sometimes found in the scoring of questions.
[…]
Measuring trends in achievement is an area of assessment that is laden with complications. The documented struggles experienced by the National Center for Education Statistics (NCES) and Harcourt Educational Measurement testify to the complexity inherent in measuring changes in achievement. Perhaps such measurement requires an assessment program that does only that. The National Center of Educational Statistics carefully tries to avoid even small changes in the NAEP tests, and examines the impact of each change on the test’s accuracy. Many state DOEs, however, unlike NCES, are measuring both individual student achievement and aggregate changes in achievement scores with the same test – a test that oftentimes contains very different questions from administration to administration. This practice counters the hard-learned lesson offered by Beaton,“If you want to measure change, do not change the measure”(Beaton et al., 1990, p. 165).

Furthermore, while it is a generally held opinion that consumers should adhere to the advice of the product developers (as is done when installing an infant car seat or when taking medication), the advice of test developers and contractors often goes unheeded in the realm of high-stakes decision-making. The presidents of two major test developers – Harcourt Brace and CTB McGraw Hill – were on record that their tests should not be used as the sole criterion for making high-stakes educational decisions (Myers, 2001; Mathews, 2000a). Yet more than half of the state DOEs are using test results as the basis for important decisions that, perhaps, these tests were not designed to support.

Finally, all of these concerns should be viewed in the context of the testing industry today. Lines (2000) observed that errors are more likely in testing programs with greater degrees of centralization and commercialization, where increased profits can only be realized by increasing market share,“The few producers cannot compete on price, because any price fall will be instantly matched by others .... What competition there is comes through marketing”(p. 1). In Minnesota, Judge Oleisky (Kurvers et al. v. NCS, Inc., 2002) observed that Basic Skills Test errors were caused by NCS’ drive to cut costs and raise profits by delivering substandard service – demonstrating that profits may be increased through methods other than marketing.

It clearly appears that profit is winning the day over quality, when it comes to standardized tests.

Here's the full paper.

Errors in Standardized Tests: A Systemic Problem