It goes to the credit of Indian Institutes of Management (IIMs) to share information about the scientific process of equating their various test forms.
Test forms (or sets of questions) vary from day to day and candidate to candidate-- even on a single day. It is important for the examination body to ensure that either each test form is of the same difficulty level (which is technically impossible) or that the scores of candidates calculated, are equated.
IIMs and Educational Testing Services (ETS) in their latest disclosure have described the science of assessment, which is commonly known and used in design of assessments. Though the information is a step in the right direction, it is of limited value without disclosure of hard metrics such as reliability values and standard error. These figures, in layman terms, describe how accurately and consistently does the test measure the true ability of a candidate.
In the west, it is mandatory for institutions to publish their test construction and scoring methodology and evidence that they are following fair and unbiased testing practices. In case IIMs have used classical testing theory for CAT, they should publish the reliability of the test. Reliability could be measured in terms of cronbach alpha or similar other metrics. The claim on their website says that they have also used item response theory. In that case, among multiple metrics the standard error is one of the key metrics.
For example, GMAT publishes standard error. “The standard error of difference for the total GMAT score is about 41, so chances are about two out of three of that difference between the total GMAT scores received by two test takers is within 41 points above or below the difference between the test takers' true scores. The standard error of difference for the Verbal score is 3.9, and for the Quantitative score, it is 4.3.” Similarly, GRE reports the reliability coefficient and standard error of its Verbal section at 0.91 and 34 respectively. For quantitative section of GRE it drops to 0.89 and 51 respectively. If the standard error is too high, then we can say with little confidence that a candidate having a higher test score actually has a higher ability.
Only these statistical measurements can help determine if the test is really working fine or not. It will determine how accurate the test is. For example how different is the capability of a 96 percentile and a 93 percentile can be determined only based on these parameters.
In case standard error is high, then we cannot say with confidence that 96 scoring candidate is better than the 93 scoring candidate. The standard error is very important for institutions that are using CAT to select candidates as it helps them use the score effectively.
The second metric for assessment quality is its validity. That tells whether even the test parameters used are justified for the purpose of admission. Currently, institutions have been far from talking about it. However, in industry these exercises are done by some high-end corporates on a regular basis.
A science is only as good its implementation. We look forward to the technical results to understand the quality of the implementation.