Statistical analysis of test item data is useful for faculty to achieve valid and reliable assessment instruments in order to appropriately measure student performance. Data elicited from this type of analysis contains information essential for identifying strengths and weakness of the instrument.

The figure below shows a sample of the type of data that is identified from a given test analysis using SoftScore (click to enlarge):

Softscore NEw.jpg

The mean, median and mode scores, as well as standard deviations, are compiled from the participants in this grouping. These scores provide the instructor with comparative results of the group as a whole.

A reliability coefficient (KR20) is correlation coefficient between two sets of scores. These two sets could be a baseline and summative collection, two baseline or summative collections, etc. A coefficient of 0 indicates no relationship between two sets of scores. A coefficient of 1 would be indicative of the same score on both administrations. The approximate range of reliability coefficients can be described as follows:.

KR20 Description
.90 and above - High reliability Suitable for making a decision about an examinee based on a single test score.
.80 to .89 - Good reliability Suitable for use in evaluating individual examinees if some of test items are improved
.70 to .79 - Moderate reliability Suitable for use in evaluating individual examinees if majority of test items are improved.

.60 to .69 - Low reliability

Suitable for evaluating individuals only if averaged with several other scores of similar reliability.

.50 to .59 – Doubtful reliability Should be used with caution to evaluate individual examinees. May be satisfactory to determine average score differences between groups.
.49 and below - Questionable reliability Should be used with caution to evaluate individual examinees. May be satisfactory to determine average score differences between groups.

Another piece of information provided is the individual response statistics for item quality. The Item# is thequestion number within the test. The Percent Selected denotes the percentage of students who selected a given answer. The following is provided for each test item as well as individual distractors.

  • Upper 27 percent of Group: Of the upper 27 percent of the students in the class, the percentage of students who responded correctly.
  • Lower 27 percent of Group: Of the lower 27 percent of the students in the class, the percentage of students who responded correctly.

The Discrimination Index measures the effectiveness of a question. It discriminates between those who have mastered the material and those who have not. It also determines question effectiveness: low, medium, or high. ExamSoft’s ScorePak® classifies item discrimination as:

  • 0.09 and below: Unacceptable.
  • 0.10 – 0.29: Fair item.
  • 0.30 – 0.39: Good item.
  • 0.40 and above: Excellent item.

Point Biserial (Pt-Biser.) coefficients are also listed. The Point-Biserial coefficient is the correlation between the score of an item and the total score on a test. In essence, it details how well an item predicts student performance on the entire exam by comparing how well students did answering one question, relative to how well they did answering all the questions. The scores range from plus and minus one. The scale below reflects the ranges of Pt. Biser scores:

  • < 0.09 or negative: Poor test distractor.
  • 0.09 - 0.19: Fair test distractor.
  • 0.20 – 0.29: Good test distractor.
  • > 0.30: Excellent test distractor.

If the Pt-Biser is a low positive or negative it can be used to identify problematic areas such as:

  • A questionable correct answer.
  • > 1 correct answer.
  • No real correct answer.
  • An ambiguous or confusing question stem.