Abstract: |
Probability models are preferred over regression models recently in contamination evaluation but lacking proper performance comparison between two model types. Linear regression, logistic regression, XGBoost-based regression, and probability models were built considering soil arsenic and certain soil physicochemical properties of 287 samples to predict arsenic in rice grains. The outputs of all models were binarily classified uniformly for comparison. The complex algorithm-based models-XGBoost-based regression (R-2 =0.046 +/- 0.036) and probability models (cross-entropy = 0.697 +/- 0.020)-did not surpass the simple linear regression (R-2 =0.046 +/- 0.031) and logistic regression models (cross-entropy = 0.694 +/- 0.021). Accuracy, sensitivity, specificity, precision, and Fl score showed that the probability models exhibit no advantage on regression models, although the indicators above did not serve as proper scoring rules for the probability model. When discretizing the contaminant concentration in grains for probabilistic modeling, the limit concentration was considered as the splitting point but not the structure of the datasets, which would reduce the inherent advantage of the probability model. When predicting the contamination of crops, the probability model cannot eliminate the regression model, and simple but robust algorithm-based models are preferred when the quality and quantity of the dataset are undesirable. |