In the period 2007-2008, almost 70% of cases with the US Office of Research Integrity involved allegations of image manipulation. Are data fabricators and falsifiers mainly showing off their photoshop skills rather than fiddling other types of data? I doubt it. I suspect the prevalence of image issues reflects the availability of tools to detect flaw in images which some journals use to check every accepted manuscript. About 1% of manuscripts are rejected in this final check.
Would it be possible to create equivalent tools and procedures for detecting anomalies in none-image data? It is going to be difficult given the immense diversity of data types and data properties. Worse, whereas suspect images can often be compared against the originals making manipulation obvious (the intent of manipulation may be harder to gauge), which may not be possible for data (how do you tell an Excel file is the original?).
One well known tool for identifying falsified data is Benford’s Law which concerns the frequency of the digits 0-9 in the first and subsequent positions in a number. Sometimes it is useful for detecting malpractice, but it would be a disaster if all data sets were tested against Benford’s law. Many data sets would fail the test because Benford’s law is not appropriate null expectation for them. No-one expects, for example, the first digit of the weight of adult humans to follow Benford’s law. The last digit of tree diameters should follow Benford’s law, but the measurements are read from a tape, it is probable that discrepancies will creep in due to difficulties in precisely estimating this least important digit (one dataset I checked had an understandable excess of zeros and fives). Determining which datasets should be expected to conform to Benford’s law would not be trivial. Even for data sets which are expected to follow Benford’s law, there would be a large number of false positives unless the p-value used to identify anomalies was set to be very low, in which case the analysis would lack power.
The same problems would occur with any mass testing. A more focused approach is needed.
A recent paper by Joel Pitt and Helene Hill with the understated title “Statistical analysis of numerical preclinical radiobiological data“, explores some methods they are useful for testing the credibility of data with three replicates. They find that one of the scientists they investigate reported data where one of the replicates to match (within rounding) the mean of all three replicates far more often either than expected under a null model or found by other scientists working with equivalent data (the scientist in question seems to have picked the mean they wanted and assigned this to the first value and then picked high and low values roughly equidistant from the first). The final digit of this scientist’s data also fails to conform to a uniform distribution as expected and found for the other scientists’ data. The comparison of patterns in the suspect data with both other data and null models is a powerful approach.
The tests Pitt and Hill develop are in response to anomalies they detect in the data.
The first step in using statistical techniques to identify fabricated data is to look for anomalous patterns of data values in a given data set (or among statistical summaries presented for separate data sets), patterns that are inconsistent with those that might ordinarily appear in genuine empirical data.
This is a critical problem. It amounts to a post-hoc test of the patterns with the consequent increased risk of a false positives. Analyse any data set in enough ways and some unexpected patterns will present themselves.
There are solutions to this problem. The first to account for the post-hoc nature of the analysis by only getting excited by anomalies with very low p-values: p = 0.05 is just not interesting, p = 10-6 is interesting, p = 10-50 is becoming very interesting. Better still is to split the available data into a test set explored for anomalies and a verification set where the most unexpected pattern found by the exploratory analysis is tested for. The verification set could be data from a second paper.
No statistical analysis will ever prove data falsification, still less the motivation for any malpractice. But it can give the evidence needed to initiate a thorough investigation by those with the authority to demand to see samples, lab notes, computer files and if necessary attempt to replicate experiments.