The following goals are set:
    'taskSubmitted':lambda a, b: maximization(a, b),
    'restarted_lost':lambda a, b: minimization(a, b),
    'restarted_noact':lambda a, b: minimization(a, b),
    'excessive_delay':lambda a, b: minimization(a, b)
For statistical tests the following threasholds are set:
	tiny_sample_size = 15 - t-test should be used
	small_sample_size = 30 - z-test can be used
	good_sample_size = 40 - sample distribution do not need to be normal
	Source: Larson and Farber Moore; Notz and Fligner, see Using the t Procedures

Testing each goal comparing new version against each previous version

Check if the new version is performing better each goal according to the goal's optimization objective (maximization or minimization). The distribution of data of the new version and each previous version must be normal and they need to present evidence of difference in order to answer whether the new version is better or not than each of its previous versions.

Comparing goal: taskSubmitted...


  taskSubmitted (0.17): New version data is comparable!
  taskSubmitted (0.17 vs 0.16): There is no evidence of difference!


  taskSubmitted (0.17 vs 0.15): There is no evidence of difference!


  taskSubmitted (0.17 vs 0.14): There is no evidence of difference!


  taskSubmitted (0.17 vs 0.12.1): The new version is BETTER (7.7) than the previous (4.9)
  taskSubmitted (0.17 vs 0.13): There is no evidence of difference!


  taskSubmitted (0.17 vs 0.12): The new version is BETTER (7.7) than the previous (4.8)
  taskSubmitted (0.17 vs 0.11): There is no evidence of difference!


  taskSubmitted (0.17 vs 0.10): The new version is BETTER (7.7) than the previous (4.8)
  taskSubmitted (0.17 vs 0.9): There is no evidence of difference!


  taskSubmitted (0.17 vs 0.8): There is no evidence of difference!


  taskSubmitted (0.17 vs 0.7): There is no evidence of difference!


  taskSubmitted (0.17 vs 0.6): The new version is BETTER (7.7) than the previous (5.4)
  taskSubmitted (0.17 vs 0.5): The new version is BETTER (7.7) than the previous (2.6)
  taskSubmitted (0.17 vs 0.4): The new version is BETTER (7.7) than the previous (2.5)
  taskSubmitted (0.17 vs 0.3): The new version is BETTER (7.7) than the previous (2.3)
  taskSubmitted (0.17 vs 0.2): The new version is BETTER (7.7) than the previous (2.0)
  taskSubmitted (0.17 vs 0.1): Data of previous version is not normally distributed.


Comparing goal: restarted_lost...


  restarted_lost (0.17): The new version does not present a normal distribution!
Comparing goal: restarted_noact...


  restarted_noact (0.17): The new version does not present a normal distribution!
Comparing goal: excessive_delay...


  excessive_delay (0.17): New version data is comparable!
  excessive_delay (0.17 vs 0.16): Data of previous version is not normally distributed.


  excessive_delay (0.17 vs 0.15): Data of previous version is not normally distributed.


  excessive_delay (0.17 vs 0.14): There is no evidence of difference!


  excessive_delay (0.17 vs 0.12.1): Data of previous version is not normally distributed.


  excessive_delay (0.17 vs 0.13): Data of previous version is not normally distributed.


  excessive_delay (0.17 vs 0.12): Data of previous version is not normally distributed.


  excessive_delay (0.17 vs 0.11): Data of previous version is not normally distributed.


  excessive_delay (0.17 vs 0.10): Data of previous version is not normally distributed.


  excessive_delay (0.17 vs 0.9): The new version is WORST (2.0) than the previous (0.0)
  excessive_delay (0.17 vs 0.8): The new version is WORST (2.0) than the previous (0.0)
  excessive_delay (0.17 vs 0.7): The new version is WORST (2.0) than the previous (0.0)
  excessive_delay (0.17 vs 0.6): The new version is WORST (2.0) than the previous (0.0)
  excessive_delay (0.17 vs 0.5): The new version is WORST (2.0) than the previous (0.0)
  excessive_delay (0.17 vs 0.4): The new version is WORST (2.0) than the previous (0.0)
  excessive_delay (0.17 vs 0.3): The new version is WORST (2.0) than the previous (0.0)
  excessive_delay (0.17 vs 0.2): The new version is WORST (2.0) than the previous (0.0)
/home/cleber/.local/lib/python3.9/site-packages/scipy/stats/morestats.py:1678: UserWarning: Input data for shapiro has range zero. The results may not be accurate.
  warnings.warn("Input data for shapiro has range zero. The results "
  excessive_delay (0.17 vs 0.1): The new version is WORST (2.0) than the previous (0.0)

Developed by Cleber Jorge Amaral.

Sources:

  • Shapiro-Wilk test
  • David S. Moore, William I. Notz, Michael A. Fligner. The Basic Practice of Statistics. 6th Edition. W. H. Freeman (2011)
  • Ron Larson and Betsy Farber. Elementary Statistics Picturing the World. 5th Edition. Prentice Hall (2012)