Tuesday, May 3, 2011

My Latest Problem

In every job, problems arise and, unfortunately, there are not always great solutions.

I ran into a problem on Saturday and I wasn’t sure what to do with it. When you are dealing with a group of college students and the decisions you make are important to those people, inconsistencies can drive you crazy.

I have two Intermediate Accounting II classes this semester. At many schools, Intermediate II is considered one of the most challenging courses in the entire university. Consequently, the grade in that class is often viewed as extremely important to the students. The difference of one letter grade can have major implications in the direction of a career.

Our final exam schedule this semester ran from Monday morning at 9 a.m. until Saturday evening. As luck would have it, one of my Intermediate II classes had its final exam in the very first slot from 9 until noon on Monday. The other class had its exam on Friday evening from 7 until 10 p.m. Because my classes were small this semester, I allowed my Intermediate II students to take either exam. I didn’t care. Eight students chose to take the first exam and 25 chose to take the second. Most students seemed uncomfortable trying to take the exam on Monday without sufficient time to prepare. That was fine by me.

I gave the first group 37 problems ranging in time from 3-8 minutes in length. No one left early but everyone seemed to be finished or close to finished by the end. I graded that exam and came up with a raw score for each. I did not curve the exam at that time because I only had 8 tests and wanted to see how the other 25 students did. For convenience, let’s assume that the raw scores ranged from minus 20 to minus 60. I liked the test; I liked the range of raw scores; I liked the length of time that my students took to finish the exam; I liked the distribution from top to bottom.

I would have loved to give the same test to my second class but I worried (especially over a 5-day period) that too much information would get out. I trust my students but I don’t want to put too much temptation out there for them. My guess is that every university worries about cheating.

So, I took each of those 37 problems and changed it slightly – I increased a 6 percent interest rate to an 11 percent rate, I changed a residual value from $10,000 to $25,000, I changed a $5,000 gain to a $9,000 loss. I then rearranged the questions into a different order just in case some student had slipped information out such as “the first question requires you to deal with a 20 percent stock option.”

I honestly believed that I was giving the second group a test that was the equivalent of the first test.

However, the results were so much worse that I was stunned. For the second group, the raw scores ranged from roughly minus 20 (same) to minus 90. Worse still, and this is what really caught my attention, approximately half of the students in the second group did worse than my very worse student in the first group. It was like the two groups took two completely different exams.

My problem became immediately obvious to me. Should the students in the first group get significantly higher grades than the second group or was there something about the second test that made it harder (and that I was not seeing)? I thought I had given comparable tests but maybe not. For example, maybe some of my changes managed to create more complex situations. Or, perhaps changing the order of the questions caused a problem for the students (maybe the first questions were now harder and slowed them down or discouraged them).

And, to make matters even worse, although virtually all of the first group finished the test on time, many of the students in the second group did not come even close to completing it. I had page after page of blanks. It is hard to give any partial credit to a blank sheet of paper.

--Could the better students have all taken the first test rather than the second? If so, then I had a justification for giving them a higher grade. But, for the most part, I couldn’t see any difference in the abilities of the two groups (and I looked very carefully).

--Could the first class have been bright and awake at 9 a.m. and the second class sleepy at 7 p.m.? They didn’t look any different but I could not peak inside of their heads. And, should that make a difference in the final grading?

--Could the first class have been fresh because it was their first final exam and the second class exhausted because it was their fourth or fifth? And, again, if so, should that have any effect on the grading? Should you factor in the time of the test when handing out grades? If two students both make minus 60 can you give one student a different grade because he or she took the test on Friday night after 3-4 other tests over a long week?

--Could some of the changes I made in the questions have subtly changed easy questions (on the first test) into hard questions (on the second test) without my awareness?

Why Why Why is the question I have asked over and over since then? Why were the raw scores so different and what should I do about that in arriving at final test scores?

It is several days later and I still do not have a good answer for that question.


  1. I think selection bias is the best explanation, but perhaps not the only. Students confident in their knowledge would rather take the test now and get it out of the way rather than wait until later.

    Alternately, the kids in the second group could have simply been burnt out from studying for other finals and didn't have the necessary time to concentrate on your exam.

    I would make the crucial reference how the scores changed relative to a mid-term and not just the raw score. A student who performed poorly on the mid-term and then poorly on the final is not a surprise, but a student who performed well on the mid-term and poorly on the final (and took it the second day) is.

  2. I had a similar issue with a Cost Accounting midterm. Have 2 sections. Section 1 meets twice a week, Section 2 meets once a week. I gave the once a week section one 2.5 hour midterm. The other section got part 1 on tuesday, part 2 on thursday. I felt that the one big test group had a much more difficult task for several reasons, not the least of which was that, after seeing what problems were on part one, they then could go concentrate for part 2 on the ones which obviously were missing.

    In the end, I ended up scaling the 2nd section in 4 places to accommodate the difference. The good news was that, in both sections, the right people got the right grades, in relative measures. Scaling was only needed to compare the two sections.