One-Year Predictions of Delayed Reward Discounting in the Adolescent Brain Cognitive Development Study

Abstract

Delayed reward discounting (DRD) refers to the extent to which an individual devalues a reward based on a temporal delay and is known to be elevated in individuals with substance use disorders and many mental illnesses. DRD has been linked previously with both features of brain structure and function, as well as various behavioral, psychological, and life history factors. However, there has been little work on the neurobiological and behavioral antecedents of DRD in childhood. This is an important question, as understanding the antecedents of DRD can provide signs of mechanisms in the development of psychopathology. The current study used baseline data from the Adolescent Brain and Cognitive Development Study (N = 4042) to build machine learning models to predict DRD at the first follow-up visit, one year later. In separate machine learning models, we tested elastic net regression, random forest regression, light gradient boosting regression, and support vector regression. In five-fold cross validation on the training set, models using an array of questionnaire/task variables were able to predict DRD,with these findings generalizing to a held-out (i.e., “lockbox”) test set of 20% of the sample. Key predictive variables were neuropsychological test performance at baseline, socioeconomic status, screen media activity, psychopathology, parenting, and personality. However, models using MRI-derived brain variables did not reliably predict DRD in either the cross-validation or held- out test set. These results suggest a combination of questionnaire/task variables as antecedents of excessive DRD in late childhood, which may presage development of problematic substance use in adolescence.

One-Year Predictions of Delayed Reward Discounting in the Adolescent Brain Cognitive Development Study

Public Health Relevance: Steep discounting of delayed rewards is a factor in many behavioral problems and psychiatric disorders. The current study demonstrates that steepness of discounting can be reliably predicted one year in advance in 10/11-year-old children using a machine learning approach with an array of questionnaire and task data. Specific variables predictive of discounting include sociodemographic factors, cognitive ability, developmental history, screen media activity, impulsive personality traits, and social activities.

Introduction

The propensity to favor immediately available rewards over those available in the future has been termed delayed reward discounting (DRD). Excessive DRD has been consistently linked to numerous adverse outcomes, including substance use disorders (Bickel & Johnson, 2003; MacKillop et al., 2011) and obesity (Amlung, Petker, Jackson, Balodis, & MacKillop, 2016). Furthermore, there is evidence that excessive DRD is predictive of future development of substance use disorders (Audrain-McGovern et al., 2009; Fernie et al., 2013) and poor prognosis in substance use disorder treatment (MacKillop & Kahler, 2009; Sheffer et al., 2014; Stanger et al., 2012; Syan, Gonzalez-Roz, Amlung, Sweet, & MacKillop, 2021). DRD is also excessive in numerous psychiatric disorders, such as bipolar disorder, borderline personality disorder, schizophrenia, attention deficit hyperactivity disorder, bulimia nervosa, binge eating disorder, and major depressive disorder (Amlung et al., 2019; Bickel et al., 2019). Because of its association with these disorders, DRD has been described as a transdiagnostic mechanism of psychopathology (Amlung et al., 2019), although there has also been criticism of the task's lack of specificity to any one diagnosis and of the distinctiveness of DRD‟s to these disorders (Bailey, Romeu, & Finn, 2021; Lempert, Steinglass, Pinto, Kable, & Simpson, 2019).

There are also numerous behavioral factors that have been linked to excessive DRD. Although DRD is considered a distinct form of impulsivity, DRD has been found to have small associations with impulsive personality traits such as negative urgency, lack of planning, and sensation seeking (MacKillop et al., 2016). DRD has also been linked to neuropsychological abilities, such as working memory (Bickel, Yi, Landes, Hill, & Baxter, 2011; Wesley et al., 2014) and overall cognitive function (Shamosh & Gray, 2008), as well as obvious products of these functions such as academic performance (Kirby, Winston, & Santiesteban, 2005). Another factor that has been linked to DRD is stress. This is true of acute stress (Fields, Lange, Ramos, Thamotharan, & Rassu, 2014), early life stress (Acheson, Vincent, Cohoon, & Lovallo, 2019), and trauma (Van Den Berk-Clark, Myerson, Green, & Grucza, 2018). Socioeconomic status, a factor directly tied to stress levels, has also been linked consistently with DRD levels (Green, Myerson, Lichtman, Rosen, & Fry, 1996; Hampton, Asadi, & Olson, 2018; Reimers, Maylor, Stewart, & Chater, 2009). Other demographic factors are also associated with DRD, with mixed evidence regarding sex differences in DRD and DRD decreasing with age (Doidge, Flora, & Toplak, 2021; Reimers et al., 2009; Steinberg et al., 2009). Furthermore, health behaviors such as exercise (Sofis, Carrillo, & Jarmolowicz, 2017) and unhealthy eating (Barlow, Reeves, McKee, Galea, & Stuckler, 2016) have been linked to DRD.

The neurobiological basis of excessive DRD has also been investigated. Four brain networks have been repeatedly shown to be activated by functional magnetic resonance imaging (FMRI) DRD tasks: the executive control network, the default mode network, the salience network, and the reward network (Carter, Meyer, & Huettel, 2010; Wesley & Bickel, 2014; Yeo et al., 2015). Individuals with substance use disorders have been found to have more activation in these networks during choices for immediate rewards (relative to baseline and to delayed rewards) and reduced activation during choices for delayed rewards (relative to baseline and to immediate rewards), suggesting that excessive DRD are characterized by differences in the activation of these brain networks (Owens et al., 2019). Gray matter structure in these networks has also been linked to DRD, with lower gray matter volume in the default mode network being linked to higher DRD in adults (Owens et al., 2017) and lower gray matter volume in the reward network linked to higher DRD in adolescents (Mackey et al., 2017). Likewise, white matter structure has also been linked to DRD, with multiple studies showing higher DRD being linked to lower white matter structural integrity connecting the prefrontal cortex to other areas of the brain (Olson et al., 2009; Peper et al., 2013). Activation in these networks during other FMRI tasks has also been linked to DRD. For example, higher activation during reward anticipation tasks, such as the monetary incentive delay (MID) task, in the reward network has been linked to lower DRD (Benningfield et al., 2014). Likewise, activation in the executive control network (especially the dorsolateral prefrontal cortex) during working memory tasks, such as the N-Back, has been linked to DRD and activation in the dorsolateral prefrontal cortex during FMRI DRD tasks (Wesley & Bickel, 2014). Furthermore, there are studies linking regional functional connectivity during resting-state FMRI to DRD, particularly connectivity of the executive control, reward, and salience networks (Contreras-Rodriguez et al., 2015; Costa Dias et al., 2013; Zhu, Cortes, Mathur, Tomasi, & Momenan, 2017).

Despite the many factors individually associated with DRD, it's unclear which factors are most important in the development of excessive DRD. This can be addressed using predictive modeling approaches that are increasingly being used in behavioral research (Yarkoni & Westfall, 2017). By predicting future DRD, it may be possible to identify potential mechanisms that contribute to the development of excessive DRD. Furthermore, predictive modeling with cross-validation can assess overfitting better than traditional explanatory analyses and ultimately lead to more generalizable findings. Another benefit of predictive modeling is the ability to simultaneously investigate many variables, including multi-modal variables, as predictors of a target. This is a particularly appealing benefit for understanding which of its correlates are uniquely important to DRD, as many of the factors linked to DRD are inter-correlated, making it difficult to disentangle the meaning of their relationships. Notably, these approaches can be used to determine if brain and behavioral variables provide unique contributions to DRD or if these factors are describing the same processes at different levels of analysis. Despite its promise, no studies, to our knowledge, have used a machine learning approach to attempt to predict DRD in the future over the course of development.

The current study aimed to predict excessive DRD one year in advance using data from the Adolescent Brain Cognitive Development SM Study (ABCD Study®) and a suite of machine learning approaches. The ABCD Study is an ongoing multi-site, longitudinal neuroimaging study following a cohort of 11,880 youths over ten years, which provides an opportunity to study DRD in a very large sample of 9-to-10-year-old children. This is a valuable age group in which to study DRD, as pre-adolescence represents the age at which impulsivity (Harden & Tucker-Drob, 2011) and DRD rates are thought to peak (Scheres, Tontsch, Thoeny, & Sumiya, 2014; Steinberg et al., 2009). It is also the age which immediately precedes the period in which substance use is typically initiated (SAMHSA, 2013). Because of the strong association of excessive DRD and disordered substance use, understanding factors that precede excessive DRD in this age group may also help to identify factors that precede the development of early and problematic patterns of substance use. In the current study, we attempted to predict DRD one year in the future, using a large array of MRI, questionnaire, and task measures. We used a nested cross-validation framework plus an external/held-out test set (which has been termed a “lockbox”) to test a variety of machine learning algorithms and approaches for building optimized models to predict DRD using MRI and questionnaire/task factors.

Methods

Procedures

The ABCD Study® is a 10-year longitudinal investigation of cognitive development in children at 21-sites across the United States (Volkow et al., 2018). 11,880 children were enrolled in the ABCD Study® at the ages of 9-and-10-years-old (between 2016–2018). Assessments are conducted in the laboratory each year while MRI is assessed every two years. Data presented in this paper include structural MRI (SMRI), diffusion MRI (DMRI), and FMRI, as well as questionnaire and computerized neuropsychological task data from the baseline visit and DRD task data from the one-year-follow-up. All baseline data used in the current study (i.e., MRI, questionnaire, neuropsychological tasks) were collected at a single visit or across two visits that occurred within 30 days of each other. Data were retrieved from ABCD data release 2.0.1, which included 11,875 participants‟ data at baseline, of which 4951 participants also had data at follow- up. The ABCD study was approved by the institutional review board of the University of California, San Diego (Institutional Review Board# 160091). Additionally, the institutional review boards of each of the 21 data collection sites approved the study. Informed consent was obtained from all parents and informed assent was obtained from participants. Data can be accessed through registration with the ABCD study at https://nda.nih.gov/abcd. We report how we determined our sample size, all data exclusions, all manipulations, and all measures in the study. This study was not pre-registered.

Participants

Exclusion criteria for the ABCD Study included MRI contraindications, like metal implants, lack of English fluency, history of major neurological disorders, premature birth (i.e., under 28 weeks), and hospitalization at birth greater than 30 days (Casey et al., 2018; Garavan et al., 2018). While the ABCD Data Analysis, Informatics & Resource Center (DAIRC) creates several indices of data quality, all exclusions from the full sample of 4,951 participants with data at baseline and follow-up were done by the research team of the current study starting from the total number of participants enrolled in the ABCD study. Participants were excluded from all analyses if they had incomplete or invalid data for the DRD task (i.e., three inconsistent points of indifference; see Measures section). This resulted in the exclusion of 782 participants for invalid data and of 127 participants for missing data. Notably, the 782 participants with invalid data differed significantly on mean indifference point from the 4042 participants with acceptable DRD data (t = .10.66, p = 2.95E-25). These participants also differed (p < .05 on t-test or chi- square test) on the 49% of questionnaire/task predictor variables and on 23% of MRI predictor variables. Participants were not excluded for missing data on the questionnaire/task variables or MRI predictor variables.

In SMRI quality control, all SMRI data were visually examined by a trained ABCD technician, who rated them from zero to three on five dimensions: motion, intensity homogeneity, white matter underestimation, pial overestimation, and magnetic susceptibility artifact. From this, an overall score was generated recommending inclusion or exclusion (Hagler et al., 2019). All subjects recommended for exclusion based on their SMRI data were excluded from all MRI modalities because the preprocessed SMRI data were required for the processing pipelines of the other modalities. Likewise, any subjects missing SMRI data were excluded from all modalities. This resulted in the exclusion of 452 participants.

Additionally, participants were excluded from the sample of each imaging modality for specific criteria related to that modality. For each of the three task FMRI paradigms, participants were excluded for having any missing FMRI data on that task, having fewer than two FMRI scans pass the image quality control performed by the DAIRC (which was similar to the DAIRC Freesurfer QC reported above), or failing to meet additional quality control criterion specific to this report. These additional quality control steps included having: 1) hemispheric mean beta- weights more than two standard deviations from the sample mean, 2) fewer than 200 degrees of freedom over the two runs, 3) mean framewise displacement > 0.9 mm for both runs, 4) failed to meet task-specific performance criteria (described in (Casey et al., 2018)). Because of a data processing error (https://github.com/ABCD-STUDY/fMRI-cleanup), participants were excluded who were collected on Philips scanners for all FMRI tasks. Additionally, for the stop signal task (SST) only, a small group of participants were excluded because of a glitch in the SST task (when the stop signal delay is 50 msec, a response that is faster than 50 msec is erroneously recorded as the response for all subsequent Stop trials, see (Garavan et al., 2020)). For resting- state FMRI (RSFMRI), we excluded any subject without one or more runs that passed the quality control inspection conducted by the DAIRC. In addition, for RSFMRI we excluded subjects with less than 375 scanner volumes (i.e., TRs). Furthermore, we excluded any subjects collected on Philips scanners, as these were also affected by the glitch reported above. Likewise, participants were excluded from the DMRI portion of analyses if they did not have at least one DMRI scan that passed the visual and automated quality control assessment conducted by the DAIRC.

These exclusion criteria resulted in participant totals of 4042 overall, 3590 for SMRI, 3444 for DMRI, 2681 for RSFMRI, 1945 for FMRI SST, 2045 for FMRI N-Back, 2263 for FMRI MID task. See Table 1for demographic information for these various samples. Additionally, there were 1325 participants with valid data in all six MRI paradigms (referenced below in “multimodal” analyses).

Measures

Delayed Reward Discounting Task (Target)

A DRD task was administered in the ABCD study at the year one follow-up, approximately one year after all neuroimaging and behavioral data were collected. The task used was a 42-item adjusting amount DRD procedure based on the one tested by Koffarnus and Bickel (Koffarnus & Bickel, 2014; Luciana et al., 2018). In this procedure, participants made repeated choices between a smaller immediate reward and a larger delayed reward. The amounts of the larger delayed rewards were held constant at a fixed dollar amount ($100) and the smaller immediate rewards were increased or decreased depending on participants‟ responding. This was done to identify indifference points at which participants are ambivalent between the immediate and delayed rewards. Seven delay periods were tested with six trials per delay period. Indifference points were identified for the six delay periods – “6 hours”, “1 day”, “1 week”, “1 month”, “3 months”, “1 year”, and “5 years”. The user manual for the procedure used in ABCD can be found at https://www.millisecond.com/download/library/v6/delaydiscountingtask/.

Overall, patterns of responding indicated that participants understood the task and were able to perceive differences between delay periods. Average discounting rates followed the expected pattern of hyperbolic discounting (Figure 1) (Green & Myerson, 2004; Green, Myerson, Shah, Estle, & Holt, 2007). This is consistent with prior literature, which suggests that pre-adolescent children are able to comprehend DRD tasks and demonstrate similar patterns of hyperbolic DRD as adults (Burns et al., 2020). Validity of individual DRD task performance was quantified by identifying inconsistencies in a participant responding. Inconsistent responding was defined as having an indifference point for a given delay that was larger than that of an indifference point for a longer delay – for example, a larger indifference point for a 1-month delay than for 1-week delay would be an inconsistency because indifference points should always decrease as the delay period increases. In analyses of inconsistencies, 21% of participants demonstrated zero inconsistencies, 33% of participants demonstrated one inconsistency, 30% demonstrated two inconsistencies, 14% of participants demonstrated three inconsistencies, and 2% of participants demonstrated four inconsistencies. In our primary analysis, we excluded all participants with three or more inconsistencies. As a supplementary analysis, we also repeated our primary analyses three times with one, two, and four inconsistencies as our threshold to confirm that this thresholding decision was not biasing results.

We examined two measures of delayed reward discounting: mean point of indifference and area under the curve. We elected not to examine k values as a measure of DRD, as some subjects in the ABCD sample did not show hyperbolic discounting, which is necessary to calculate a valid k value, and because the hyperbolic nature of DRD is not as well established in children. For mean point of indifference, we simply took the mean of the seven indifference points provided in the ABCD data release. For area under the curve, we used the approach outlined in Myerson, Green, & Warusawitharana (2001), which involves scaling the change between indifference points by the amount of time between them. In our full sample (N = 4042), mean indifference and area under the curve had a Pearson's correlation of .96, p < .001. We conducted separate predictive modeling analyses using each measure as the target, finding similar results for each using MRI predictors. However, when using questionnaire/task data as predictors, mean indifference was better predicted in the five-fold cross-validation. As a result, before conducting tests on our lockbox test set, we opted to make mean indifference our primary DRD outcome measure (reported in main text) and area under the curve a secondary outcome measure (reported in Supplemental Materials). Additionally, mean indifference was used in all other Supplementary analyses, in which we look for approaches that would better predict DRD using MRI predictors.

Questionnaire/Task Predictors

131 questionnaire/task variables were used as predictors in all questionnaire/task models described. These included cognitive tasks and self-report questionnaires completed by participants and their parents/guardians. Measures contributing variables to the final questionnaire/task predictor elastic net model are described here and all measures are described in detail in Supplemental Methods. Measures described in the Supplemental Methods include the Edinburgh Handedness Inventory (Veale, 2014), the Wechsler Intelligence Scale for Children-V (Wechsler, 2014), the Prodromal Psychosis Scale (Karcher et al., 2018), the Sleep Disturbance Scale (Romeo et al., 2013), the BIS/BAS (Bijttebier, Beck, Claes, & Vandereycken, 2009), a medical history questionnaire (Todd, Joyner, Heath, Neuman, & Reich, 2003), the Family History Assessment Module Screener (Rice et al., 1995), the Child Report of Behavior Inventory (Barber, Olsen, & Shagle, 1994), the Achenbach Adult Self-Report (Achenbach & Rescorla, 2003), and the Family Environment Scale (Moos & Moos, 1986).

Demographic Questionnaire

A demographics questionnaire was also administered to each child's parent/guardian to determine demographic information including the child's sex, age, household income, parental education, and parental marital status.

Pubertal Development Scale

Pubertal status was assessed using the pubertal development scale (Petersen, Crockett, Richards, & Boxer, 1988), which was completed by a parent/guardian and by the participant, with results of the two being averaged. This measure has been shown to have good reliability and to correspond with accepted self-report and biological measures of pubertal development (Petersen et al., 1988).

NIH Toolbox

Assessments were selected from the National Institutes of Health (NIH) Toolbox for evaluating neurological and behavioral function. The toolbox consists of several different tasks including 1) Picture Vocabulary Test, 2) Flanker Inhibitory Control and Attention Test, 3) Dimensional Change Card Sort Test, 4) List Sorting Working Memory Test, 5) Pattern Comparison Processing Speed Test, and 6) Oral Reading Recognition Test. These tests are designed, respectively, to assess 1) language, 2) inhibitory control and attention, 3) executive function, 4) working memory, 5) processing speed, and 6) episodic memory (Gershon et al., 2013; Hodes, Insel, Landis, & NIH Blueprint for Neuroscience Research, 2013; Luciana et al., 2018). The tasks were administered by trained research staff using an iPad. In addition to specific task scores, the NIH Toolbox produces Fluid and Crystallized Composite Scores, which were used in the current study along with all individual task scores.

Child Behavior Checklist

The child behavior checklist (CBCL) is a checklist completed by parents about emotional and behavioral symptoms experienced by their children aged 6-18 (Achenbach, 2007). This measure has been shown to have good validity and reliability for its syndrome scales (Dopfner, Schmeck, Berner, Lehmkuhl, & Poustka, 1994). In the current study, the empirically based syndrome scales used were the Attention Problems Scale, Thought Problems Scale, Externalizing Composite (and its three subscales), and Internalizing Composite (and its three subscales).

The Kiddie Schedule for Affective Disorders and Schizophrenia

The Kiddie Schedule for Affective Disorders and Schizophrenia (K-SADS) is an interview for the diagnosis of psychiatric disorders in children aged 6-18. It was administered to children by trained ABCD research staff and was self-administered on an iPAD by parents. The K-SADS has shown to be both valid and reliable for the diagnosis of psychiatric disorders, aligning well with other established diagnostic protocols (Kaufman et al., 1997). In the current study, we utilized data on psychiatric disorder diagnoses, traumatic experiences in the child‟s life, and background information about the child (e.g., grades in school, conflicts in school).

Abbreviated Youth Version of the UPPS-P Impulsive Behavior Scale

The version of the UPPS-P used (Watts, Smith, Barch, & Sher, 2020) assesses the five facets of impulsivity: Negative Urgency, Positive Urgency, Premeditation, Perseverance, and Sensation Seeking. The measure has 20 items, four for each facet of impulsivity, rated on a 1 (agree strongly) to 4 (disagree strongly) scale. The scale is generally similar to the adult abbreviated version, but with several items altered to be more appropriate for children. This scale demonstrates the same five- factor structure as the adult version and shows good convergent and discriminant validity with relevant personality, psychopathology, and neurocognitive measures (Watts et al., 2020).

Developmental History

A developmental history questionnaire was developed by the Adolescent Component of the National Comorbidity Survey (Kessler et al., 2009) with some additional supplemental questions on maternal use of substances during pregnancy. The questionnaire contains information about maternal prenatal care, maternal substance use during pregnancy (including caffeine and tobacco), prenatal maternal health conditions (e.g., gestational diabetes), prematurity, birth complications, and developmental milestones. From this measure, we used scales of maternal drug use while pregnant and scales assessing 10 aspects of pregnancy/birth/development. We also created two summary scales of specific maternal problems during pregnancy and of specific complications during birth.

Substance Use – Caffeine, availability.

In the ABCD study, recent caffeine consumption was assessed via modified Supplemental Beverage Questions (Lisdahl et al., 2018). In this assessment, the youth was asked the typical number of caffeinated drinks they consumed in the past 6 months, from categories of coffee, espresso, tea with caffeine, soda with caffeine, and energy drinks with typical serving sizes.

Sports and Activities Scale

The Sports and Activities Involvement Questionnaire (SAIQ) was a parent-report measure capturing youth participation in specific sports and activities (Barch et al., 2018). From this, we derived four sum scales for team sports (e.g., basketball), individual sports (e.g., tennis), performance sports (e.g., dance), hobbies (e.g., coin collecting).

Resilience Scale (Friendships)

A child-report survey (termed the ABCD Other Resilience Scale) was administered measuring the number of and the type of youth‟s friendships. The current study used scales counting the number of participants‟ close friendships of the same and opposite sex.

Screen Time

Screen time was measured as the weekly average hours a child typically spends on a computer, cellphone, tablet, or other electronic device (Paulus et al., 2019). This included: watching TV shows or movies, watching videos, playing video games, texting or chatting, visiting social network sites, and video chatting. In the current study, we used a youth-report and a parent-report of average screen time on weekends and weekdays.

Prosocial Scale (Youth and Parent)

Parents and youth both rated the level of prosocial behavior of the youth using the Prosocial Behavior Scale from the Strengths and Difficulties Questionnaire (Goodman, 2001). From this 3-item scale assessing helping, sharing, and comforting we derived a total score from youth-report and a separate scale from parent-report, which were both included.

Parental Monitoring Questionnaire

The Parental Monitoring Scale is a 5-item scale completed by youths that assesses the degree to which their parents are generally aware of their whereabouts (Zucker et al., 2018).

MRI Predictors

Since there is prior evidence and theoretical reasons supporting a connection between each of the six MRI paradigms conducted in the ABCD study, we included data from each of the scanning modalities collected as predictors in MRI models. For all modalities except DMRI and RSFMRI, cortical data were parcellated using the Destrieux atlas (Destrieux, Fischl, Dale, & Halgren, 2010) and subcortical data were segmented Freesurfer standard subcortical segmentation atlas (ASEG). For SMRI analyses, we used cortical thickness and surface area, as well as subcortical gray matter volume. DMRI data were divided into the white matter tracts described in (Hagler et al., 2009). From this, we used DMRI measures fractional anisotropy, mean diffusivity, transverse diffusivity, and fiber volume. For RSFMRI, we used functional connectivity between networks of the Gordon parcellation (Gordon et al., 2016) and between Gordon parcellation networks and subcortical structures (measured by the FreeSurfer ASEG segmentation). For each FMRI task, we used beta values from several contrasts per task.

FMRI Tasks

The tasks used in the current study have been described previously at length (Casey et al., 2018; Chaarani et al., 2020). In short, the EN-Back task was a modified version of a traditional N-Back task in which participants viewed a series of stimuli and for each responded if that stimulus matched the one they saw N items ago (i.e., “N back”). The task had two conditions: a 2-back as its active condition and a 0-back as the baseline condition. The stimuli for this task were images of faces of different emotional valences and images of places. The MID task included both anticipation and receipt of reward and loss. In this task, participants viewed an incentive cue for 2 seconds (anticipation) and then quickly respond to a target to win or avoid losing money ($5.00 or $0.20). Participants were then given feedback about their performance (receipt). The baseline used was “neutral” trials in which participants completed the same actionbut with no money available to be won or lost. The SST consisted of serial presentations of leftward and rightward facing arrows. Participants were instructed to indicate the direction of the arrows using a two-button response box (the “go” signal), except when the left or right arrow was followed by an arrow pointing upward (the “stop” signal). Participants were also instructed to respond as “quickly and accurately as possible”. Trials were then categorized based on the participant‟s accuracy (“correct” and “incorrect”).

Magnetic Resonance Imaging Acquisition and Processing

MRI scans were acquired at sites across the United States using 26 different scanners from two vendors (Siemens and General Electric); there were also 3 sites using Philips scanners that were excluded from analyses due to an error in processing prior to their release. MRI sequences are reported in Supplemental Methods and in prior work (Casey et al., 2018). All SMRI, FMRI, and DMRI data were preprocessed by the DAIRC using pipelines that have been detailed in prior work (Hagler et al., 2019). SMRI data were preprocessed using FreeSurfer version 5.3 (Hagler et al., 2019) to produce cortical thickness and cortical surface area measures for each of the 74 Destrieux atlas (Destrieux et al., 2010) regions of interest in each hemisphere (148 regions total) and gray matter volume for nine subcortical regions in FreeSurfer‟s “ASEG” parcellation in each hemisphere (18 regions total), plus the brainstem which was not split by hemisphere. AtlasTrack, a probabilistic atlas-based method for automated segmentation of white matter fiber tracts, was applied to the DMRI data to derive fractional anisotropy, mean diffusivity, transverse diffusivity, and fiber volume measures for 17 bilateral white matter tracts and 3 tracts that connect the brain‟s hemispheres (Hagler et al., 2009).

RSFMRI data were preprocessed by the DAIRC and this processing is detailed in (Hagler et al., 2019). RSFMRI data were registered to the first frame to account for head motion, corrected for spatial and intensity distortions, and co-registered with SMRI scans. Subsequent preprocessing steps specific to RSFMRI included removal of initial volumes, normalization and demeaning, censoring of volumes with > 0.2mm motion, band-pass filtering (between .009 and .08 Hz), and within-subject (1st level) regression to remove quadratic trends, motion, and mean time courses of cerebral white matter, ventricles, and whole brain (as well as their derivatives). Data were sampled onto the cortical surfaces and divided into the 422 cortical parcels that make up the 13 functionally-defined networks described in (Gordon et al., 2016). Correlations were calculated for the average timeseries of vertices in each ROI pair and these correlations were z- transformed. For within-network connectivity (i.e., coherence of networks), the average was taken of the z-transformed correlations of all ROIs in that network. For between-network connectivity (i.e., connectivity or anticorrelation of networks), the average was taken of z- transformed correlations between each ROI pair across the networks.20

Task FMRI data were preprocessed using a multi-program pipeline similar to the RSFMRI pipeline. This processing that yielded neural activation in these same cortical and subcortical regions for each FMRI contrast. The contrasts used for the SST were incorrect stop – correct go, correct stop – correct go, correct go - Baseline; for the EN-Back the contrasts were2-Back - 0-Back, Faces – Places, Negative Faces – Neutral Faces, Positive Faces – Neutral Faces, 0-Back – Baseline, and Places - Baseline; for the MID, contrasts were reward anticipation – neutral anticipation and positive reward outcome – negative reward outcome (i.e., win – loss). Beta values were extracted for each contrast from each of the 74 Destrieux atlas (Destrieux et al., 2010) regions of interest in each hemisphere (148 regions total), for nine subcortical regions in FreeSurfer's “ASEG” parcellation in each hemisphere (18 regions total), and the brainstem which was not split by hemisphere.

Data Analysis

Data Download & Processing

DRD, questionnaire/task, and MRI data were downloaded from the ABCD Data Repository on the National Institute of Mental Health Data Archive, using the RDS data file from ABCD release 2.0.1. Missing questionnaire/task and MRI data was mean imputed. Analyses were conducted in Python, using Brain Predictability Toolbox version 1.3.4 (Hahn et al., 2020). Analysis code is available on Github (https://github.com/owensmax/ABCD-DRD- Prediction).

Analytic Framework

The general analytic framework used was to predict DRD using the MRI and questionnaire/task features described above. We conducted model training and initial validation in a 5-fold cross-validation, then to guard against false positives we attempted to replicate successful predictions (i.e., those with R2 > 0.0%) on a lockbox test set that was not involved in initial model training. Our approach in choosing features was exploratory in that we attempted to use a variety of measures from across the ABCD study rather than selecting feature variables based on theory or prior results. In our primary analyses, we ran seven separate predictive modeling analyses with each of the six MRI modality data types as features (SMRI, RSFMRI, DMRI, and 3 FMRI tasks), as well as one analysis with all six MRI modalities used together as features. Quality control was conducted to ensure participant data were valid for all features used for a given predictive model, resulting in different sample sizes for each modality. Consequently, to provide a comparison unbiased by the different sample sizes of each MRI modality, we ran six separate predictive models with the questionnaire/task variables as predictors in the samples of each of the six MRI modalities. Additionally, to determine predictive accuracy of questionnaire/task variable models using the largest possible sample size we ran predictive modeling analyses using a seventh subset of all participants for the questionnaire/task variables only models. Then, for MRI modalities which had R2 > 0.0% in the 5-fold cross-validation and in the lockbox test set, we ran separate predictive models containing MRI and questionnaire/task features, to determine if the addition of MRI features improved performance of the models that used only questionnaire/task features. We considered this an important test to determine the incremental benefit of MRI over questionnaire/task data. A threshold of R2 > 0.0% was chosen to ensure no replicable effect, no matter how small, was missed. We repeated the entire process described above using four machine learning algorithms. Specifically, we tested four machine learning algorithms: 1) elastic net regression (Zou & Hastie, 2005), 2) random forest regression (Breiman, 2001), 3) light gradient boosting regression (Friedman, 2001), and 4) support vector regression with an radial basis function kernel (Boser, Guyon, & Vapnik, 1992). After finding similar results with all four models, we elected to focus on feature importance from the elastic net regression, because it performed the best across both questionnaire/task variable predictions and yielded the most interpretable coefficients of all approaches. Both elastic net regression and random forest regression are capable of accommodating large numbers of highly correlated variables (Breiman, 2001; Zou & Hastie, 2005) and empirical work has shown that elastic net, in particular, is highly effective at dealing with highly correlated features such as those found in the analysis of MRI and FMRI data (Jollans et al., 2019; Sirimongkolkasem & Drikvandi, 2019). However, to ensure that an overly large and intercorrelated feature space was not diminishing the effectiveness of our analyses, we also conducted a supplementary analysis attempting to use MRI features to predict DRD using recursive feature elimination, which is a wrap-around algorithm designed to limit the features used in the model to only those having relevance to the target in the training set.

Cross-Validation

Initially, data were divided in an 80%/20% split, with the 80% used as a training set for model building and the 20% used as a final external/held-out test set (what has been called a “lockbox”). Then, in the training set, model training was conducted in a 5-fold cross-validation framework with 80% of the training data used for model training and 20% of the training data used as an independent, internal validation set on which each model was initially tested. Note that in this process, the validation set was not involved in model building and had not been seen by the model previously. The goal of this validation was to select models sufficiently promising to test in the lockbox test set. This cross-validation framework is illustrated in Figure 2. Within this 5-fold cross-validation, hyperparameter tuning was performed in the training set with a nested 3-fold cross validation and the combination of parameters with the highest average score was selected for the model tested on the internal validation set.

We elected to evaluate an evolutionary search algorithm and random hyperparameter search algorithm rather than using a more traditional grid search approach to hyperparameter tuning. This was done because of evidence that both evolutionary search algorithms and random search algorithms are equal or better in effectiveness at optimizing hyperparameters (Alibrahim & Ludwig, 2021; Bergstra & Bengio, 2012; Mantovani, Rossi, Vanschoren, Bischl, & De Carvalho, 2015). We specifically selected the Discrete One Plus One Evolutionary Search algorithm because it was recommended by the hyperparameter search package that we used, called Nevergrad (https://facebookresearch.github.io/nevergrad/machinelearning.html). One Plus One evolutionary search algorithm is a very simple evolutionary algorithm, useful for optimizing relatively simple search spaces (Droste, Jansen, & Wegener, 2002). This optimization strategy is similar to other evolutionary algorithms but is further simplified in that it only maintains a population of two hyperparameter combinations at any given time, a parent and single offspring. The offspring, or new candidate hyperparameter combination, is generated by mutating the parent's hyperparameter values through a 'discrete' mutation. A hyperparameter is mutated simply by drawing a new value from a normal distribution, the 'discrete' requirements specifying that the new value be re-drawn until different enough from the original parameter. If the child or mutated copy performs better, it is replaced as the new parent; otherwise, it is discarded, and the process is repeated. Evolutionary search algorithms have been shown to be more effective in hyperparameter selection than grid search or Bayesian search methods (Alibrahim & Ludwig, 2021) and they have been shown to be able to produce numerous effective combinations of hyperparameters (Sipper, Fu, Ahuja, & Moore, 2018). Random search simply selects random combinations of hyperparameters and tests each; it has been shown to be as or more effective than a traditional grid search (Bergstra & Bengio, 2012; Mantovani et al., 2015). After running our primary analyses in the training/validation set with both the evolutionary search and random search methods and finding them comparable, we elected to focus the manuscript primarily on the results of the evolutionary algorithm and report the random search in the Supplementary Materials. Then, to ensure hyperparameter selection was not a limiting factor in predictive model accuracy, we conducted our primary analyses using multiple alternative hyperparameter selection approaches (described below), finding negligible differences in results.

Prediction accuracy was measured with R2. All analytic variations described (i.e., different algorithms, MRI modalities, quality control approaches) were conducted in the training set and for each the average R2 across each of the five folds are reported for prediction of the validation set. Throughout these analyses, participants were grouped among cross-validation folds by family to ensure all siblings were in the same folds.

Only a subset of the MRI feature predictive models was tested on the lockbox test set, namely those models that had a validation R2 > 0.0% using MRI data only. Additionally, the elastic net model using questionnaire/task variables with data from the whole sample was tested on the lockbox test set. Thus, only a limited number of models were tested on the lockbox test set, reducing the likelihood of false positives induced by testing numerous models for out-of- sample prediction on the test set. MRI modalities that were predictive of DRD in both the internal validation and lockbox test set were then input into MRI and questionnaire/task feature models, which were tested in the same 5-fold cross-validation framework. We considered effective MRI models to be those in which MRI features plus questionnaire/task features performed better than questionnaire/task features alone, indicating incremental benefit of MRI data over questionnaire and out-of-scanner task data.

Follow-up Analyses (Alternative Brain Prediction Approaches)

After these primary analyses, we conducted several series of follow-up analyses to ensure that the performance of the MRI-based models was not a result of specific data processing or analytic decisions. Because this approach involved testing numerous models, we only considered models to be generalizable that had R2 > 1% in the internal validation set (on average across 5 folds) and in the lockbox test set, and that had a higher internal validation R2 for questionnaire/task and MRI features than in questionnaire/task features only model. The first supplementary analysis we did was to repeat the primary analyses using different thresholds of exclusion for DRD task performance, repeating our primary analysis using exclusions of one, two, and four inconsistent points of indifference instead of the three used in our primary analyses. Next, we repeated our primary analysis using a calculated version of area under the DRD curve as the target variable, instead of mean point of indifference. Additionally, we repeated our primary analyses without excluding participants with poor quality MRI data, which allowed for more participants in brain analyses (albeit, with poorer quality data included). As noted above, we also attempted to improve upon our primary analyses through feature selection, repeating our primary analyses using recursive feature elimination with ridge regression.

Additionally, as noted above, we tested alternative hyperparameter selection approaches to determine if hyperparameter selection was limiting the effectiveness of the models using MRI variables to predict DRD. While our primary analysis used the One Plus One evolutionary search algorithm with 60 iterations, we also tested the One Plus One evolutionary search algorithm with 100, and 200 iterations, random search with 60, 100, and 200 iterations, Hammersley Search Plus Middle Point with 200 iterations, and Two Points Differential Evolutionary algorithm with 200 iterations. The Hammersley Search Plus Middle Point algorithm is a type of quasi-random search, which uses Hammersley random sampling to achieve more “regularity”; that is, it behaves more similarly to grid search in exploring the hyperparameter space while retaining the benefits of random search (Wong, Luk, & Heng, 1997). Differential Evolutionary algorithms iteratively attempt to improve a candidate hyperparameter combinations by performing “crossover” with other candidate hyperparameters, resulting in novel combinations. Unlike other approaches, they use stochastic processes instead of gradient ones, which allow them to be effective in exploring high dimensional search spaces (Knobloch, 2018; Storn & Price, 1997).

Furthermore, to ensure no confounding effects of site were altering our findings we repeated our analyses grouping our cross-validation by family ID and site to ensure members of the same site and family were place in the same folds. Finally, to confirm the appropriateness and sensitivity of our predictive modeling approach, we applied the same methods as our primary analyses to predict general intelligence (IQ) based on the NIH Toolbox Total Composite Score. Similar to other predictive modeling studies (e.g., (Dubois, Galdi, Han, Paul, & Adolphs, 2018), we used these IQ predictions as a positive control for our primary analyses given that it is established that IQ can be predicted from brain structure (Mihalik et al., 2019) and function (Dubois et al., 2018; Finn et al., 2015).

Results Questionnaire/Task Predictive Models

Questionnaire/task-based predictive models showed similar performance across various subsets of the data (i.e., samples), as the only difference in these analyses was the sample size, with larger sample sizes tending to yield slightly more effective predictive models (See Table 2A; Figure 3). The four algorithms used yielded similar effectiveness in predicting DRD. Given this comparable performance across algorithms, we elected to use an elastic net questionnaire/task-based predictive model built on the entire training sample as our test of generalization to the held-out lockbox test set. Elastic net was chosen given its superior feature importance interpretability over the other three algorithms. The elastic net model built on the entire training set using questionnaire/task variables replicated the findings of the 5-fold cross-validation (R2 = 4.2%), which was comparable to the best performing fold of the 5-fold cross- validation (R2 = 4.9%).

Given the generalizability of the elastic net model, we report feature importance for the predictors in this model built on the entire training set (80% of data) and tested on the held-out lockbox test set (20% of data). To further clarify these regression coefficients, we also examined non-cross-validated, bivariate ordinary least squares regressions between the features in the elastic net model and mean point of indifference; from these we report the significance and univariate R2. The features, their elastic net regression coefficients, and the univariate p-value and R2 values are reported in Table 3. There were five features in the full sample elastic net model with coefficients over 1. The individual predictor with the largest coefficient was performance on the picture vocabulary task of the NIH toolbox, with lower performance being a predictor of greater DRD one year later. The predictor with the second largest coefficient was household income, specifically being in the lowest class of household income (parents earning less than $50,000 annually) was a predictor of higher levels of DRD. Another predictor of greater DRD was the child's parents not being married. Furthermore, high levels of screen time were predictors of higher DRD, with both child and parent reported weekday screen time being used by the model, as well as parent reported weekend screen time.

Other factors that predicted greater DRD included child caffeine use, child positive urgency, delays in speech development early in life, having a small number of opposite sex friends, rule breaking symptoms on the Child Behavior Checklist (CBCL), withdrawn depressed symptoms on the CBCL, being male, and delays in meeting developmental milestones in early life. Features with comparable betas predicting lower DRD included having good grades in school, high performance on the NIH toolbox card sort task, high levels of prosocial behavior, and high levels of parental monitoring. Additionally, predictors of greater DRD with coefficients below .1 were late motor development in early life and having large numbers of same sex friends. Predictors of lower DRD with regression coefficients below .1 were crystalized intelligence composite from the NIH toolbox and participation in individual sports.

MRI Predictive Models

Brain-based predictive model performance in the 5-fold cross-validation is displayed in Table 2B and in Figure 4. RSFMRI was the only modality which showed predictive accuracy above zero in three of the four algorithms. FMRI SST data showed accuracy above zero using two of the four algorithms. FMRI N-back data showed accuracy above zero using random forest only. No predictive model achieved an R2 of 1% or higher. For all MRI modality/algorithm combinations with R2 > 0.0% in MRI models in the training set, we built new models on the entire training set which were tested in the lockbox test set. These MRI predictor models generally did more poorly in predicting the test set than they had in the training data, with most models having R2 ≤ 0.0% in the lockbox test set. The only models to achieve R2 > 0.0% were elastic net and random forest regression using RSFMRI data.

MRI plus Questionnaire/Task Models

Next, to test if MRI-based features improved predictive accuracy for DRD beyond questionnaire/task factors, models were built with both MRI and questionnaire/task variables as predictors for algorithm/modality combinations whose “MRI only” models had R2 > 0.0% for predicting the internal validation set and the lockbox test set (i.e., elastic net and random forest regression models using RSFMRI data). When models using questionnaire/task and MRI data were built for each of these two modality/algorithm combinations using the same 5-fold cross- validation approach, MRI and questionnaire/task models performed worse at predicting DRD than models using questionnaire/task data alone (Table 2C). This suggests that MRI data did not improve the prediction of DRD beyond the questionnaire/task variables used and were, in fact, making these predictions worse. As such, we concluded that, in our primary analyses, MRI variables were not robust and unique predictors of DRD.

Alternative MRI Prediction Approaches

After failing to find effective predictive approaches using MRI variables, we conducted numerous alternative analyses that altered a single aspect of our primary analyses. However, in these analyses no alternative MRI prediction approaches were effective at deriving an effective brain-based model to predict DRD. When we set a threshold of 4 inconsistent points of indifference, results were comparable to our primary analyses, with one model performing better then R2 ≥ 1% in the internal validation set, but this did not replicate in the lockbox test set (Supplemental Table 1). When we set a threshold of 2 or 1 inconsistent points of indifference, no models performed above R2 ≥ 1% in the internal validation set (Supplemental Table 2 and Supplemental Table 3). When we used an alternate measure of DRD as the target variable (area under the discounting curve, as opposed to mean indifference point), models were overall less effective than using our primary measure of DRD (Supplemental Table 4). None of our alternate hyperparameter search approaches yielded more effective models for predicting DRD with MRI data (Supplemental Table 5). When we repeated our primary analysis grouping participants by site in the cross-validation, one model was above chance in the internal validation, but it did not generalize to the lockbox test set (Supplemental Table 6). In analyses using recursive feature elimination, MRI model results were comparable to primary analyses with no models having R2 ≥ 1% (Supplemental Table 7).

Prediction of Intelligence

To confirm the efficacy of the machine learning approach used, we applied the same predictive modeling methods as our primary analysis to predict the total cognitive composite of the NIH Toolbox (i.e., a proxy for IQ). Across MRI modalities the most efficient algorithms were elastic net and support vector regression. An elastic net regression model using the FMRI N-Back features were able to predict IQ with R2 = 15.5% in the 5-fold cross-validation and R2 = 15.1% in the lockbox test set. Elastic net regression models using SMRI features and using DMRI features predicted IQ with R2 > 7% in the 5-fold cross-validation and were R2 = 8.5% and 6.3% in the lockbox test set, respectively. A support vector regression model with RSFMRI features predicted IQ with an R2 = 4% in the 5-fold cross-validation and R2 = 5.1% in the lockbox test set. Elastic net regression with FMRI monetary incentive delay task features predicted IQ with R2 = 2.4% in the 5-fold cross-validation and 6.5% in the lockbox test set. Results for all models are shown in Supplemental Figure 1 and Supplemental Table 8.

Discussion

The current results suggest that there are a variety of questionnaire/task factors that may contribute to higher levels of DRD. Models using questionnaire/task variables as predictors were effective at predicting DRD using all algorithms, sample sizes, and approaches we examined. Most questionnaire/task feature models had R2 between 2% and 3% in the 5-fold cross- validation. The different machine learning algorithms performed comparably, suggesting algorithm choice was not a driving factor of results. The elastic net regression model using questionnaire/task features in the full sample performed well in predicting the lockbox test set, roughly approximating the best fold in the 5-fold cross-validation (R2 = 4.6%). Specific domains of importance in this model included demographic factors, cognitive function, school performance, developmental history, symptoms of psychopathology, personality, caffeine use, and participation in certain social and recreational activities. However, none of the six MRI paradigms examined yielded models that were effective at predicting elevated DRD, generalizable to the test set, and useful beyond questionnaire/task feature models. This was the case despite our efforts to test multiple machine learning algorithms and data processing approaches. The lack of effective MRI-based predictions of DRD comes in contrast to our control models, which were quite effective and generalizable at predicting IQ using only MRI- based features. These results indicate that questionnaire/task factors are more predictive of childhood elevated DRD one year in advance than MRI-measured brain factors. The current results do not speak directly to the reason why questionnaire/task measures are better predictors, but one possibility is that they are more similar to the DRD task, which was, itself, a questionnaire/task measure.

One clear finding was that cognitive ability was an important predictor of DRD. The most important single predictor in the elastic net regression model using questionnaire/task variables in the full sample was performance on the picture vocabulary test of the NIH Toolbox. The card sort task of the NIH toolbox was also an important predictor in the final elastic net model and the NIH Toolbox crystallized intelligence composite was an included in the model but was a relatively less important predictor. There is ample existing evidence that DRD is linked to intelligence, with meta-analysis suggesting a moderately sized association (Shamosh & Gray, 2008). The picture vocabulary task is almost exclusively a measure of vocabulary and the task requires few other cognitive abilities. However, vocabulary tests are an effective index of education or overall intelligence and serve as an effective proxy for crystalized intelligence (Bright & van der Linde, 2020). Indeed, the crystalized intelligence composite score itself was included in the questionnaire/task model, which more clearly highlights the predictive value of general intelligence for DRD. The domain of cognition that has most frequently been linked to DRD is working memory (Bickel et al., 2011; Kurth-Nelson, Bickel, & Redish, 2012; Wesley & Bickel, 2014). Our modeling algorithms did not utilize the specific working memory task in the NIH toolbox (i.e., the list sort task). However, the card sort task is a general measure of executive function and cognitive flexibility (Zelazo, 2006), requiring planning, self-monitoring, and set-shifting in the pursuit of strategic goals. Since the card sort task measures a broad array of executive functions, its use by the model likely represents executive function generally as a predictor, which may explain why the model did not require the specific working memory task. Another index of cognitive ability used in the model was children‟s school performance, which was among the more important predictors in the model. Parent-reported school performance was found to be a predictor of DRD, with “A‟s/excellent” school performance being a predictor of lower DRD. This finding is consistent with prior work showing the DRD is associated with school performance, even accounting for other relevant factors (Freeney & O‟Connell, 2010; Lee et al., 2012).

Critical to the current predictive models of DRD beyond individual factors like intelligence were demographic and sociological factors. The second most important predictor of DRD was low household income, which was a predictor of elevated DRD. Furthermore, having married parents was the fourth most important predictor and sex was one of the relatively less important predictors. The use of low household income in the model was unsurprising, as socioeconomic hardship has been linked to DRD numerous times in the past (Farah et al., 2006; Green et al., 1996; Hampton et al., 2018; Reimers et al., 2009). This is likely due to a number of reasons, including the effects of stress in increasing allostatic load (Juster, McEwen, & Lupien, 2010) and as a rational adaptation to harsh and unpredictable environments (Frankenhuis, Panchanathan, & Nettle, 2016). While there is less pre-existing evidence of a relationship, parental marital status also has multiple possible pathways to association with DRD. For example, DRD might be linked to parent marital status through the association of parent marital status and socioeconomic status (Karney, 2021) or through the association of marriage with religiosity (Pew Research Center, 2014), which has been linked to lower rates of DRD (Weatherly & Plumm, 2012). A recent meta-analysis did not find sex differences in DRD among typically developing children, though it did find higher DRD in females with attention deficit/hyperactivity disorder than males with the disorder (Doidge et al., 2021). However, in the current study, being male was a predictor of higher DRD.

Other individual difference measures that were important predictors of DRD included impulsive personality traits, symptoms of psychopathology, and prosocial behavior. Positive urgency measured by the UPPS-P was one of the more important predictors of higher DRD. This is consistent with work showing DRD and impulsive traits to be associated, with positive urgency in specific having been weakly linked to DRD in previous reviews (MacKillop et al., 2016). One reason why positive urgency, specifically, may be the impulsive personality trait utilized by the current model is that this trait is the one best measured by the short form of the child-focused UPPS-P used in the current study (Watts et al., 2020). Thus, it is possible that inclusion of positive urgency in the predictive model reflects a more general effect of impulsive traits predicting DRD than a specific importance of this trait in its prediction. The two indices of psychopathology used in the final predictive model were the Rule-Breaking and Withdrawn/Depressed syndrome scales from the CBCL. The Rule-Breaking syndrome scale of the CBCL corresponds closely to diagnoses of conduct disorder (Yule et al., 2020) and children with conduct disorder have been shown to have higher DRD (Blair et al., 2020; White et al., 2014). Likewise, the Withdrawn/Depressed syndrome scale of the CBCL corresponds to diagnoses of major depressive disorder (Ebesutani et al., 2010) and DRD has been linked to depressive symptoms and major depression diagnosis (Imhoff, Harris, Weiser, & Reynolds, 2014; Pulcu et al., 2014). In addition, children‟s level of prosocial behavior was also a trait characteristic that predicted DRD, with more prosocial behavior predicting less DRD.

Another predictor of higher DRD was previous failure to meet important developmental milestones, specifically in domains of speech development and motor development. To our knowledge there is no previously established relationship between these developmental milestones and DRD. However, there is work showing DRD diminishes across childhood and adolescence (Steinberg et al., 2009) and it is possible that meeting these milestones late represents a general trend toward slower development that also impacts the development of DRD. Children‟s social network also served as an important predictor of DRD, as children with more same and opposite sex friends showed higher levels of DRD. This may be a proxy for extraversion, which was not measured in the current study, but has been linked to higher DRD in adults (Hirsh, Morisano, & Peterson, 2008). Furthermore, participation in individual sports (e.g., tennis) was a predictor of lower DRD and greater parental monitoring was also a predictor of lower DRD. Previous work has shown that parenting practices experienced in childhood can have an influence on adult DRD (Holmes et al., 2020) and it is possible that the inclusion of these variables as predictors is capturing similar processes.

One of the most interesting predictors of higher DRD was the amount of time children used screen technology such as television, video games, computers, and smartphones. A recent study measured young adults‟ smartphone use through self-report and actual usage data and found that more smartphone use was linked to higher DRD (Schulz van Endert & Mohr, 2020). Increasingly, over-use of technological devices is coming to be viewed through a lens of addictive behavior (though this framing is not without controversy, e.g., Bean, Nielsen, van Rooij, & Ferguson, 2017). For example, work has established the validity of measures of video game addiction (Lemmens, Valkenburg, & Peter, 2009), internet addiction (Thatcher & Goolam, 2005), and smartphone addiction (Kwon et al., 2013; Lin et al., 2014). Furthermore, internet gaming disorder is recognized by the ICD-11 (World Health Organization, 2020) and is among the recommended disorders for further study in the DSM-5 (American Psychiatric Association, 2013). As such, a plausible explanation for why screen time was selected as a predictor in the current study is that high levels of screen time may be an early emerging form of addictive behavior. However, an alternative hypothesis is that, in the current predictive model, screen time is serving as an additional index of socioeconomic status while the behavior of using screens is itself irrelevant to DRD. Children and adolescents from lower socioeconomic backgrounds have been shown to spend more time using screens, with these effects particularly true for girls (Carson, Spence, Cutumisu, & Cargill, 2010; Männikkö, Ruotsalainen, Miettunen, Marttila- Tornio, & Kääriäinen, 2020). Disentangling these possibilities is outside the scope of the current study, but this represents an important question going forward.

Also notable among predictors of greater DRD was higher caffeine use. There is some work in college students linking caffeinated energy drinks to higher DRD and alcohol use (Meredith, Sweeney, Johnson, Johnson, & Griffiths, 2016), though this is rather different from the context of the current study (caffeine use in 10-year-old children). Interestingly, there are rodent model studies showing that caffeine use acutely reduces DRD, with DRD levels returning to normal after caffeine is withdrawn (Diller, Saunders, & Anderson, 2008). This suggests the possibility that caffeine use may be an attempt at self-medication of impulsivity in participants with greater DRD (either intentional or unintentional). However, future work is needed to address whether this finding suggests greater DRD is a biological effect of caffeine, an indicator of self-medication, or a secondary effect caused by caffeine‟s collinearity with other factors.

Taken together, the finding that these questionnaire/task factors were able to predict DRD one year later suggests that they may contribute to higher levels of DRD. However, further research is needed to determine if this is the case. For one, though the predictions made of DRD by the questionnaire/task factors were generalizable out-of-sample in multiple levels of cross- validation (5-fold internal cross-validation, held-out „lockbox‟ test set), the R2 values found for these predictions were still quite small, suggesting that the majority of variance in DRD in the ABCD sample is explained by factors other than those included in the current analysis. As such, these findings should be interpreted as a “true but weak” prediction. Additionally, the current study was unable to determine if these factors preceded the development of higher DRD, as DRD was not measured at baseline. Consequently, we could not test if our predictive models would predict change in DRD. Furthermore, that fact that these questionnaire/task factors predict higher future DRD does not necessarily indicate a causal relationship. It is possible that these factors and the higher DRD are both a part of a pre-existing trajectory. However, correlation is a necessary pre-condition of a causal relationship and these results present initial hypothesis generation for future work on the causal factors to the development of excessive DRD. While holding these considerations in mind, the current results do provide a meaningful initial investigation into the question of what factors contribute to elevations in DRD that characterize addiction and other psychological disorders.

The MRI model results of the current study suggest that predicting late childhood DRD one year in the future using brain data is a challenging task that, if possible, would require a different strategy than the ones used in the current manuscript. We attempted numerous approaches to predicting DRD using brain data with little success. We tested four machine learning algorithms coming from three distinct algorithm classes (regularized linear regression, decision tree, and non-linear decision boundary). We also examined six different modalities of MRI data, which we modeled in both separate and joint approaches. We tested multiple quality control procedures for neuroimaging and DRD data and explored feature selection procedures in deciding which MRI features to include. However, there are an infinite number of possible machine learning approaches to a given prediction problem. One could test any number of algorithms, cross-validation strategies, hyperparameter tuning approaches, and feature selection methods. Those reported here represent only a subset of the possible combinations of these parameters. However, the approaches we did test represent a broad variety of contemporary approaches which are consistent with the state of the current literature. Furthermore, to confirm the general efficacy of our predictive approach, we demonstrated that it could effectively predict general intelligence based on the NIH Toolbox Total Composite Score with multiple MRI modalities as predictors and using all the algorithms we selected. In short, we think the current results provide strong evidence that DRD is not easily predicted from MRI data collected one year earlier. One interpretation of these results is that DRD may be more a product of an individual's environment during childhood rather than of individual neurobiological characteristics. However, this remains conjectural until further investigated.

One consideration for this work is that it cannot rule out the possibility that other predictive modeling approaches to predicting DRD with MRI data may be more successful. For example, one approach that could potentially prove to be more effective is using deep learning approaches to extract information from minimally processed brain image data. Given the computational intensity of such an approach, however, we felt this was beyond the scope of the current manuscript, which focuses on a diverse array of standard machine learning approaches that can be implemented on region of interest data using a common pipeline. Another approach that may prove more effective in the future is to predict DRD using contemporaneously collected MRI data, which would represent an easier problem on which to determine the optimal modeling strategy than predicting DRD one year in advance, as was done in the current study. This could be achieved in future ABCD data releases when DRD and MRI data are available at the same time point. However, the current results indicate a distinct possibility that DRD may represent a fundamentally difficult construct to predict in children using current neurobiological tools.

An important consideration in interpreting these findings is the DRD measure used. While the specific DRD task used has been validated in prior work (Koffarnus & Bickel, 2014) and DRD tasks generally have been shown to be effective in pre-adolescent children (Burns et al., 2020), there remain reasons for caution about this task. Most work linking DRD to neurobiology has been conducted in adults or older adolescents. While DRD measures generally have been shown to have good test-retest reliability (Anokhin, Golosheykin, & Mulligan, 2015; Beck & Triplett, 2009) it is unknown whether the current DRD task exhibits the same reliability in children. Testing the reliability of this DRD task can be addressed in future waves of the ABCD study. Another reason for concern is the large number of participants who showed inconsistent performance on the task. Most participants demonstrated at least one inconsistency in their points of indifference. We used three inconsistencies as our primary exclusion criteria to balance concerns of sample size and data quality, although in supplemental analyses we repeated analyses using 1, 2, and 4 inconsistencies as exclusion criteria, finding similar results. The large number of participants showing inconsistency in their data suggests the DRD measure used may be noisier than a similar measure in adults, which may have contributed to the difficulties in predicting DRD using brain data. Additionally, the participants excluded for poor performance differed from those not excluded on numerous measures, including steepness of DRD. This suggests some participants with very steep DRD were likely excluded, which may have dampened the effect sizes found here. Despite these concerns, the DRD measure used is clearly capturing some individual differences as it could be predicted by the questionnaire/task measures. It is possible that brain data may be able to predict future DRD better in an older participant sample. Fortunately, this question can be tested in the future as the ABCD participants become adolescents and young adults.

Another notable feature of the current results was the small effect sizes found. The questionnaire/task model in the full sample had an R2 = 3.0% in the five-fold cross-validation, which is equivalent to a Pearson's correlation of .09. According to Cohen's heuristics this would be a small effect (Cohen, 1988). However, there is increasingly research suggesting that Cohen‟' heuristics are overly optimistic about the size of effects researchers in psychology and cognitive neuroscience should expect and that effects of size r = .1- .2 are typical of effects being identified in large samples in this field (Marek et al., 2020; Owens et al., 2021; Schäfer & Schwarz, 2019). There is also reason to believe that small effect sizes can still be important for consideration, given their ability to accumulate over time and space (Abelson, 1985; Funder & Ozer, 2019). Thus, we do not consider this to be a weakness of the current study, but rather a finding to be expected considering the effect sizes being seen in the ABCD Study and other large datasets.

Another factor worth noting is the homogeneity of pubertal status in the current sample. In the full sample included in the current study (N = 4042), 35% were pre-pubertal, 36% were early-pubertal, 27% were mid-pubertal, 2% were late-pubertal, and .2% were post-pubertal. We think that a more homogenous sample of equal size may have yielded more accurate predictive models. However, we also do not consider this to be a critical issue, given that most participants were prepubertal or early in puberty and that the puberty scale was not a predictor of DRD in our questionnaire/task elastic net model. Further, we think the benefit of the large sample afforded by the ABCD study is worth the cost of this heterogeneity, given the clear link between sample size and model performance shown in our analyses of questionnaire/task factors across various sample sizes, which consistently showed better predictive model performance at larger sample sizes.

In conclusion, these results found an array of questionnaire/task indicators that could make generalizable out-of-sample predictions of DRD one year in advance. Key features in predicting DRD included cognitive function, demographic factors, personal traits, symptoms of psychopathology, developmental history, social interactions with parents and friends, and specific recreational behaviors (e.g., screen media activity, sports participation). However, predictive models using brain structure and function measured by MRI were ineffective at predicting DRD. These findings suggest a variety of questionnaire/task factors that may serve as mechanisms in the development of excessive DRD that should be investigated further.

Disclosures and Acknowledgements

J.M. is a Principal and Senior Scientist at BEAM Diagnostics, Inc. No other authors have conflicts of interest to report. This work was funded by NIH/NIDA T32DA043593. Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive Development (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041022, U01DA041028, U01DA041048, U01DA041089, U01DA041106, U01DA041117, U01DA041120, U01DA041134, U01DA041148, U01DA041156, U01DA041174, U24DA041123, U24DA041147, U01DA041093, and U01DA041025. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/scientists/workgroups/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators. The ABCD data repository grows and changes over time. The ABCD data used in this report came from version 2.0.1. Additionally, the authors would like to thank Kyla Belisario for assistance with data processing. This work was previously presented at the 2021 Organization for Human Brain Mapping annual convention as a poster.

Link to Article

Abstract