Comprehensive accuracy comparisons for time-series forecasting:
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0194889

https://www.wired.com/story/the-exaggerated-promise-of-data-mining/?mbid=social_twitter_onsiteshare

1) Justin A. Sirignano, Apaar Sadhwani, and Kay Giesecke,  Deep Learning for Mortgage Risk

https://arxiv.org/pdf/1607.02470.pdf 


2) Spyros Makridakis, Evangelos Spiliotis, Vassilios Assimakopoulos, Statistical and Machine Learning forecasting methods: Concerns and ways forward  

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0194889
 
Stop explaining black boxes and look for explainable models: https://www.nature.com/articles/s42256-019-0048-x

-------------------------------------------------
From Walter Kremers <Kremers.Walter@mayo.edu> 2023-04-19

Thank you for your response and insights.  I share your experience in medical research that deviation from linearity is a more frequent concern than product or interaction terms for quantitative predictors.  Any analysis should begin with consideration of transformations to approximate linearity, informed form previous knowledge and the data.  My discussion is not to discount this consideration but to address how linear models and machine learning tools compare after this step has been taken.  One attractive characteristic of artificial neural networks (ANNs) is, with their being splines, they can account for nonlinearities.  Done blindly, though, any model will not do as well as when a directed approach is used, like you describe.  ANNs are also good at picking up interactions between categorical predictors, which are more often described. (Interestingly neither GBMs nor ANNs are as good as one might hope at picking up product terms in quantitative predictors.  To do this well they still need a fair amount of data.)  

We have used ANNs for multiple studies involving image data, e.g. pathology slides and radiographs.  Here the ANNs have been a valuable tool.  A is “well known”, the ANNs do well where there is a high signal to noise ratio despite complexities in patterns.  When we can identify something by eye, even if we cannot think of how to describe this in a “numerical model” the ANNs are often good at teasing this out.  For “typical” tabular data involving clinical variables the patterns are often difficult to discern and ANNs may not perform as well as classical methods.  This though is largely determined by the amount of information in the data as reflected in part by sample size.   For small data classical statistical methods generally perform better and are easier to relate to biological mechanisms than ANNs.  As already mentioned, with large data ANNs can perform very well.  My question then is, where is the middle ground where ANNs start to compete with or out perform classical methods. 

To better understand how different models perform in medical practice I put together a program that runs multiple models on an input dataset and then compares performances, similar to your presentation in the “Scientific Reports” article.  I run this program on datasets I encounter where machine learning methods “might” be of value.  I can simulate datasets which show one model or the other performs better but I want to know how performances compare for data I am likely to encounter.  In this program I have a module for ANNs.* To make meaningful comparisons between classical and machine learning methods I also wanted to include “reasonably” fit ANNs.   I don’t want to discount ANNs just because of a poor numerical routine.  That would be amiss and bad research.  This is why I consider the nested model structure when fitting the ANNs.  This does noticeably improve the ANN fits.  Basically, for “medium to large datasets” I can only comfortably tell my collaborators that an ANN will not be of value for a study when this is supported by the data as indicated by a reasonably fit ANN model.           

Yes, as you mention calibration is also an issue, and should be considered as part of any model building process.  In general, the ANNs can calibrate well depending on how one selects a. model.  If one tunes on number of iterations, starting weights and model structures, they do well.  If one tunes on L1 or L2 penalties I expect they will have similar weakness as with lasso and ridge regression in the classical setting.     

* Here too I have made this “off the shelf” in that all code needed to
  start a fit at the (possibly recalibrated) lasso model is internal
  and no “hands on” processing is required.  The program for model
  comparison is in R as a function and therefore is easily accessible
  to a broad community.  To run the program the user need just provide
  the data and specify the model foundation, e.g. “cox”, “binomial” or
  “gaussian”.  The ANNs are fit using the “torch” package and the code
  can serve as a gentle introduction to ANN programming for those
  wanting to work within the R environment.  R “torch” is very
  flexible being built on the same library as “PyTorch”.  It also
  makes use of the multiple cores on a laptop or desktop and so runs
  relatively fast.