China Animal Husbandry and Veterinary Medicine ›› 2024, Vol. 51 ›› Issue (9): 4060-4065.doi: 10.16431/j.cnki.1671-7236.2024.09.034

• Preventive Veterinary Medicine • Previous Articles    

Construction of Machine Learning Models to Predict the Cross-host Infection Risk of Diarrheagenic Escherichia coli Based on CRISPR Sequence

FENG Xinyuan, ZHAO Jiaxue, LONG Jinzhao, HU Jingyan, XI Yanyan, CHEN Shuaiyin, YANG Haiyan, DUAN Guangcai   

  1. College of Public Health, Zhengzhou University, Zhengzhou 450016, China
  • Received:2024-01-15 Published:2024-08-27

Abstract: 【Objective】 This study was aimed to predict the cross-host infection risk of diarrheagenic Escherichia coli and identify the zoonotic isolates based on CRISPR sequences by machine learning. 【Method】 The genome sequence information of 806 strains of diarrheic Escherichia coli isolated in China was obtained from Enterobase database.The spacer sequence construction features of CRISPR sites were extracted.Subsequently, the machine learning models were established and their performances were evaluated using 10-fold cross-validations.Moreover, the zoonotic risk for each isolates was obtained by the best-fitted model and the zoonotic potential risks with different animal sources were compared. 【Result】 A total of 1 093 spacer sequence clusters were obtained from 806 isolates, containing 196 unique spacer sequence clusters of human, 291 unique spacer sequence clusters of animal, and 606 spacer sequence clusters shared between human and animal.Linear discriminant analysis showed that there were significant differences in the distribution of interval sequence clusters between human and animal strains.Subsequently, random forest, logistic regression, support vector machine and gradient boosting decision tree models were established and successfully predicted the source for their accuracy were all >0.82 and their area under receiver operating characteristic curve (AUC) value were all close to 0.9.Finally, the random forest model performed best after optimization, its accuracy was 0.844 and its AUC value was 0.915.According to infected risk of each isolates generated by the best model, the swine isolates displayed the highest risk to infect human, the ovine isolates performed a low risk to infect human, and only a few poultry isolates might exhibit the potential to infect human. 【Conclusion】 The machine learning model based on spacers sequences could identify isolates with the zoonotic potential, which provided new insights in control and prevention of infectious disease.

Key words: spacer sequences; machine learning; diarrheagenic Escherichia coli; cross-host infection risk prediction

CLC Number: