Distributed Medical Statistics Gathering And Testing

(This started as a project for the Feb 24-25 2012 Hacking Health hackathon)

In the same vein that the Free Software movement and Wikipedia have brought a democratization process to software development and compilation of encyclopedic knowledge, it should be possible to extend the gathering and testing of non-intrusive medical hypothesis beyond field specialists.

In a nutshell, have a central server that contains data and programs used to tally the data, while participants collect ODLs, information about their health and behavior (not unlike the quantified self movement). The participants choose which data gathering exercises to participate and their privacy is assured through a cryptographic exercise described below.


For the sake of presentation, let us make some assumptions:

  • the data is a simple value / key store (more complex representations are possible)
  • the data gatherer program is a small javascript function that accesses the participant data store and updates its own data store
  • each data gatherer program has a unique id
  • each participant has a unique (secret) id


A person comes up with a hypothesis "people who bike have less chances of developing pancreatic cancer" and writes it up as a data gatherer in the central server. Once it becomes available, other participants might decide to participate in the experiment. (Better researched and explained experiments might then get more attention.)

The willing participant then downloads the data gatherer program with its data store initialized with a random numbers uniformly distributed. After running the data gathering program, it then passes it to other participants either on-line or in person using an NFC-enabled cellphone. The data won't sent back to the main server until a data gathering agent arrives to a participant that already took part on the data gathering.


  • A data gathering exercise will have to have significant traction among the participants to gather useful data.
  • As the population is self-selected, the hypothesis tested won't have enough quality for traditional research purposes (but it could inform the need for doing such experiments down the road).
  • The data collection aspect on behalf of the participants is quite uncertain and need to span years to be useful.
  • While each participant is privy to its own gathered data, having the data in an Internet-connected device is a security risk.
  • An adversarial participant can run a modified version of the software that will destroy existing data and mislead the experimental results (this will pose a real problem on hot button topic such as the impact of vaccines).