13  Gathering and Cleaning Data!

Information for the Fiscal Futures all funds database comes from detailed information provided by the Illinois Comptroller’s office.  Currently the database includes 300,000 individual records for fiscal years 1998 to 2022.

Normally, when your receive the new fiscal year files from the Comptrollers office, you will need to change the variable names so that they are consistent with past years. This is an example of reading in the new file and changing the variable names. They seem to change almost every year in the file received from the FOIA so if the code breaks here, check to make sure that the columns you are trying to rename exist and are spelled correctly! Once variables are the same, you will want to save the file as a csv file in its Revenue/Expenditure file and bind all past years and the current year together in one dataframe.

#Example code below: Read in excel file and rename columns so that it plays well with the other years' files.

revenue_fy23 <- read_xlsx("Fis_Fut_Rev_2023_Final.xlsx") %>% 
  rename(fy = 'FY',
         fund = 'FUND',
         fund_name = 'FUND NAME',
         agency = 'AGENCY',
         agency_name = 'AGENCY NAME',
         source = 'REVENUE SOURCE',
         source_name = 'REV SRC NAME',
         receipts = 'AMOUNT'
  ) 

exp_fy23 <- read_xlsx("Fis_Fut_Exp_2023_Final.xlsx") %>% 
  rename(fy = 'FY',
         fund = 'FUND',
         fund_name = 'FUND NAME',
         agency = 'AGENCY',
         agency_name = 'AGENCY NAME',
         appr_org = 'ORGANIZATION',
         org_name = 'ORGANIZATION NAME',
         obj_seq_type = 'APPROPRIATION',
         wh_approp_name = 'APPROPRIATION NAME',
        # exp_net_xfer = 'NET OF TRANS AMOUNT',
         expenditure = 'EXPENDED'

  ) 

# %>%
#   # these come from ioc_source file after merging
#   mutate(data_source = "exp IOC Aug 2022",
#          object = ,
#          seq = ,
#          type = ,
#          fund_cat = FIND_COLUMN, #create fund_cat column
#          fund_cat_name = FIND_NAME) # create fund_cat_name column

The code chunk below takes the .dta files for all fiscal years before FY 2022 and binds them together. Variable names were manually changed by past researchers so that they were consistent across years.

For FY 2023 and after, .dta files can be avoided entirely and .csv or excel files will be used. All files before this year had been saved and passed on as .dta files for Stata code before the transition to R in Fall 2022.

For years after fy22: add line of code to bind csv files after binding the dta files together. Variable names must be identical to merge files together.

Code below reads in the csv files created in chunks above (allrevfiles.csv and allrexpfiles.csv). These files contain all years of data combined into one file BEFORE any recoding is done. Do not use this file for summing categories because it is just an inbetween step before recoding revenue and expenditure categories. It also does not contain changes in funds or agencies.

Code
# combined in past chunks called create-rev-csv and create-exp-csv

allrevfiles23 <- read_csv("./data/allrevfiles_Nov132023.csv") #combined but not recoded

allexpfiles23 <- read_csv("./data/allexpfiles_Nov132023.csv") #combined but not recoded

13.0.1 Inspecting new FY23 data files

Data files for closed years have been obtained from IOC. The numbers of funds, agencies, organizations, and revenue sources below were found by using pivot tables in the codebook files in the FY2022 Box folder. This could also be done using R and grouping new files by fund, agency, source number, source names, etc..

Revenue File:

  • 684 Fund Numbers
  • 80 Agencies
  • 1184 Revenue source numbers
  • 1156 revenue source names

Expenditure File:

  • 708 Fund Numbers
  • 107 Agencies
  • 98 Organization Numbers
  • 313 Organization names

13.0.2 Finding new agencies and funds

General steps:

  1. Identify new and reused funds for newest fiscal year.
  2. Recode funds to take into account different fund numbers/names over the years. See [Recoding New and Reused Funds] for code chunk that does this.
  3. Update fund_ab_in_CURRENTFY.xlsx with any changes from previous fiscal year.

New Agencies, Funds, and Organizations from Expenditure files:

  • Using the code below, I found 2 agencies, multiple funds, and a couple organizations that had not been specifically mentioned in the funds_ab_in file - AWM, FY2022.
  • Note: One new org name and org number combo has an expenditure of $600 million for pensions payment.
Code
#allrevfiles23 <- read_csv("allrevfiles23.csv") #combined but not recoded
#allexpfiles23 <- read_csv("allexpfiles23.csv") #combined but not recoded



cur_fy  <- 2023  # Create variable for current fiscal year to save time for updating in future years.


#### From Expenditure Data #####

# agencies referenced in any year before 2023:
agencies_past <- allexpfiles23 %>% 
  filter(fy < cur_fy) %>% 
  mutate(agency == as.character(agency)) %>% 
  group_by(agency, agency_name) %>% unique() %>% 
  summarize(expenditure = sum(expenditure, na.rm = TRUE)) %>% 
  drop_na() %>% 
  arrange(agency)
# agencies_past # 148 agencies ever


# agencies in 2023 data:
agencies23 <- allexpfiles23  %>% 
  filter(fy == cur_fy) %>% 
  mutate(agency == as.character(agency)) %>% 
  group_by(agency, agency_name) %>% 
  summarize(expenditure = sum(expenditure, na.rm = TRUE))


# 280 and 533 are new agency codes from old code cleaning methodology.
# When creating fiscal future categories for the fiscal gap, 280 is assigned to Group = 920 ( Judicial) expenditure category
# 533 was assigned to "949. Other Boards and Commissions"

# no new agencies from fy22 to fy23: 
anti_join(agencies23, agencies_past, 
          by = c("agency", "agency_name")) %>% 
  arrange(agency)




funds_past <- allexpfiles23  %>% 
  filter(fy < cur_fy) %>% 
  mutate(fund == as.character(fund)) %>% 
  group_by(fund, fund_name) %>% 
  summarize(count = n(), Expenditure = sum(expenditure, na.rm = TRUE))  %>% 
  drop_na()

funds23 <- allexpfiles23  %>% 
  filter(fy == cur_fy) %>% 
  mutate(fund == as.character(fund)) %>% 
  group_by(fund, fund_name) %>%  
  summarize(count = n(), Expenditure = sum(expenditure, na.rm = TRUE)) %>% 
  unique()


# 26 funds were in FY23 data that were not in past data:
anti_join(funds23, funds_past, 
          by = c("fund", "fund_name")) %>% 
  arrange(fund)
# Each year these must examined closely to determine if they are a new fund, a reused fund, or if there is just a slight difference in the name of the fund 


# orgs_past for orgs in the past =  920 org groups ever
orgs_past <- allexpfiles23  %>% 
  filter(fy < cur_fy) %>% 
  mutate(appr_org == as.character(appr_org)) %>% 
  group_by(appr_org, org_name) %>% unique() %>% 
  summarize(Expenditure = sum(expenditure, na.rm = TRUE)) %>% 
  drop_na()

# orgs_past # 916 org groups ever
orgs23 <- allexpfiles23  %>% 
  filter(fy == cur_fy) %>% 
  mutate(appr_org = as.character(appr_org)) %>% 
  group_by(appr_org, org_name) %>% 
  summarize(Expenditure = sum(expenditure, na.rm = TRUE))
# orgs22 # 399 org groups this year


# 3 org number and org name combos are new for FY2023:
# Usually not a big deal, orgs are not used much in the gap calculation.
# but good to stay aware and look for any change that may be important.
anti_join(orgs23, orgs_past,
          by = c("appr_org", "org_name")) %>% 
  arrange(appr_org)

Largest change for FY23 was the $184 million Grocery Tax Replacement fund and $20 million for fund 0156.

New Revenue Funds, Sources, and New Agencies:

Code
#### From Revenue Data ####


# agencies_past # 108 agencies ever
agencies_past <- allrevfiles23  %>% filter(fy < cur_fy) %>% 
  mutate(agency == as.character(agency)) %>% 
  group_by(agency, agency_name) %>% unique() %>% 
  summarize(Receipts = sum(receipts, na.rm = TRUE)) %>% 
  drop_na()


agencies23 <- allrevfiles23  %>% 
  filter(fy == cur_fy) %>% 
  mutate(agency == as.character(agency)) %>% 
  group_by(agency, agency_name) %>% 
  summarize(Receipts = sum(receipts, na.rm = TRUE))


# Finds 533 and 280 again. 
anti_join(agencies23, agencies_past, by = c("agency", "agency_name")) %>% 
  arrange(agency)


funds_past <- allrevfiles23  %>% 
  filter(fy < cur_fy) %>% 
  mutate(fund == as.character(fund)) %>% 
  group_by(fund, fund_name) %>% 
  summarize(count = n(), Receipts = sum(receipts, na.rm = TRUE))  %>% 
  drop_na()

funds23 <- allrevfiles23  %>% 
  filter(fy == cur_fy) %>% 
  mutate(fund == as.character(fund)) %>% 
  group_by(fund, fund_name) %>%  
  summarize(count = n(), Receipts = sum(receipts, na.rm = TRUE)) %>% 
  unique() %>% 
  drop_na()

# 13 revenue funds were in FY23 revenue data that were not in past data 
# some could be small fund name changes
anti_join(funds23, funds_past, by = c("fund", "fund_name")) %>% 
  arrange(fund)



sources_past <- allrevfiles23  %>% 
  filter(fy < cur_fy) %>% 
  mutate(source == as.character(source)) %>% 
  group_by(source, source_name) %>% 
  summarize(count = n(), Receipts = sum(receipts, na.rm = TRUE))  %>% 
  drop_na()

sources23 <- allrevfiles23  %>% 
  filter(fy == cur_fy) %>% 
  mutate(source == as.character(source)) %>% 
  group_by(source, source_name) %>% 
  summarize(count = n(), Receipts = sum(receipts, na.rm = TRUE)) %>% 
  unique()


# 16 revenue sources were in FY23 data that were not in past data 
# some could be small source name changes:
anti_join(sources23, sources_past, by = c("source", "source_name")) %>% 
  arrange(source)

13.0.3 Recoding New and Reused Funds

Warning

Remember: allrevfiles and allexpfiles contain the data received from the comptroller without recoding agencies or funds that have changed between fiscal years. Variables were renamed when needed to have consistent names but funds and agencies have NOT been recoded for consistency in these files.

New or reused funds were manually added to the funds_ab_in.xlsx file and determined if they should or should not be included in Fiscal Future calculations.

For funds that were reused once, a 9 replaces the 0 as the first digit. If reused twice, then the first two values are 10.

  • Ex. 0350 –> 9350 because its use changed.
  • Ex. 0367 becomes 10367 because its use has changed twice now. There was fund 0367 originally, then its use changed and it was recoded as 9367, and now it changed again so it is a 10367.
  • Excel file also has alternative ways to name funds (e.g. 0397-A and 0397-B) and variables for the year that the fund stopped being used.

New or reused funds revenue file recoding:

# if first character is a 0, replace with a 9 if its purpose has changed

rev_1998_2023 <- allrevfiles23 %>%

      mutate(fund = ifelse(fy < 2002 & fund %in% c("0730", "0241", "0350", "0367", "0381", "0382", "0526", "0603", "0734", "0913", "0379"), str_replace(fund, "0","9"), fund)) %>%
  
  mutate(fund = ifelse (fy < 2008 & fund %in% c("0027", "0033", "0037", "0058", "0062", "0066", "0075", "0083", "0116", "0119", "0120", "0122", "0148", "0149", "0157", "0158", "0166", "0194", "0201", "0209", "0211", "0217", "0223", "0231", "0234", "0253", "0320", "0503", "0505", "0512", "0516", "0531", "0532", "0533", "0547", "0563", "0579", "0591", "0606", "0616", "0624", "0659", "0662", "0665", "0676", "0710", 

"0068", "0076", "0115", "0119", "0168", "0182", "0199", "0241", "0307", "0506", "0509", "0513"), str_replace(fund, "0","9"), fund)) %>%

  mutate(fund = ifelse(fy < 2016 & fund %in% c("0263", "0399", "0409"), str_replace(fund, "0","9"), fund)) %>%

  mutate(fund =  ifelse(fy < 2017 & fund == "0364", str_replace(fund, "0","9"), fund)) %>%

  mutate(fund =  ifelse(fy < 2018 & fund %in% c("0818", "0767", "0671", "0593", "0578"), str_replace(fund, "0","9"), fund)) %>%


  mutate(fund = ifelse(fy>1999 & fy < 2018 & fund == "0231", "10231", fund) ) %>%
  

  mutate(fund = ifelse(fy < 2019 & fund %in% c("0161", "0489", "0500", "0612", "0893", "0766"), str_replace(fund, "0","9"), fund)) %>%

  mutate(fund =  ifelse(fy < 2020 & fund %in% c("0254", "0304", "0324", "0610", "0887", "0908", "0939", "0968"), str_replace(fund, "0","9"), fund)) %>%

  mutate(fund =  ifelse(fy < 2021 & fund %in% c("0255", "0325", "0348", "0967", "0972"), str_replace(fund, "0","9"), fund) ) %>%
  
   #2022 changes

  mutate(fund = ifelse(fy < 2022 & fund %in% c("0110","0165","0351", "0392", "0393", "0422", "0544", "0628", "0634",  "0656", "0672", "0683", "0723", "0742", "0743"), str_replace(fund, "0","9"), as.character(fund))) %>%  # replaces first 0 it finds with a 9

  mutate(fund = ifelse(fy < 2022 & fund == "0367", "10367", as.character(fund)) # fund reused for 3rd time

) %>%

  # 2023 fund changes
  mutate(fund = ifelse(fy < 2023 & fund %in% c("0099","0210","0246", "0250", "0264", "0268", "0300", "0311", "0403", "0448","0645", "0727", "0729", "0791"), str_replace(fund,"0", "9"), as.character(fund))) %>%
  mutate(fund = ifelse(fy < 2023 & fund == "0734" , "10734", as.character(fund)),
         fund = ifelse(fy<2023 & fund == "0820", "10820", as.character(fund) ) # fund reused for 3rd time
  ) %>%

 #  other fund changes noticed on April 1, 2024
  mutate(fund = ifelse(fy < 2023 & fund %in% c("0490", "0684", "0747", "0869"), str_replace(fund,"0", "9"), as.character(fund))) 

Expenditure recoding:

Code
# if first character is a 0, replace with a 9

exp_1998_2023 <- allexpfiles23 %>%

      mutate(fund = ifelse(fy < 2002 & fund %in% c("0730", "0241", "0350", "0367", "0381", "0382", "0526", "0603", "0734", "0913", "0379"), str_replace(fund, "0","9"), fund)) %>%

  mutate(fund = ifelse(fy < 2008 & fund %in% c("0027", "0033", "0037", "0058", "0062", "0066", "0075", "0083", "0116", "0119", "0120", "0122", "0148", "0149", "0157", "0158", "0166", "0194", "0201", "0209", "0211", "0217", "0223", "0231", "0234", "0253", "0320", "0503", "0505", "0512", "0516", "0531", "0532", "0533", "0547", "0563", "0579", "0591", "0606", "0616", "0624", "0659", "0662", "0665", "0676", "0710", 

"0068", "0076", "0115", "0119", "0168", "0182", "0199", "0241", "0307", "0506", "0509", "0513"), str_replace(fund, "0","9"), fund)) %>%
  
  mutate(fund = ifelse(fy < 2016 & fund %in% c("0263", "0399", "0409"), str_replace(fund, "0","9"), fund)) %>%

  mutate(fund =  ifelse(fy < 2017 & fund == "0364", str_replace(fund, "0","9"), fund)) %>%

  mutate(fund =  ifelse(fy < 2018 & fund %in% c("0818", "0767", "0671", "0593", "0578"), str_replace(fund, "0","9"), fund)) %>%

  mutate(fund = ifelse(fy>1999 & fy < 2018 & fund == "0231", "10231", fund) ) %>%
  
  mutate(fund = ifelse(fy < 2019 & fund %in% c("0161", "0489", "0500", "0612", "0893", "0766"), str_replace(fund, "0","9"), fund)) %>%

  mutate(fund =  ifelse(fy < 2020 & fund %in% c("0254", "0304", "0324", "0610", "0887", "0908", "0939", "0968"), str_replace(fund, "0","9"), fund)) %>%

  mutate(fund =  ifelse(fy < 2021 & fund %in% c("0255", "0325", "0348", "0967", "0972"), str_replace(fund, "0","9"), fund))  %>%
  
  #2022 changes
  mutate(fund = ifelse(fy < 2022 & fund %in% c("0110","0165","0351", "0392", "0393", "0422", "0544", "0628", "0634",  "0656", "0672", "0683","0723", "0742", "0743"), str_replace(fund, "0","9"), as.character(fund))) %>%  # replaces first 0 it finds with a 9

  mutate(fund = ifelse(fy < 2022 & fund == "0367", "10367", as.character(fund)) # fund reused for 3rd time 

  ) %>%

  # 2023 fund changes
  mutate(fund = ifelse(fy < 2023 & fund %in% c("0099","0210","0246", "0250", "0264", "0268", "0300", "0311", "0403", "0448","0645", "0727", "0729", "0791"), str_replace(fund,"0", "9"), as.character(fund))) %>%
  mutate(fund = ifelse(fy < 2023 & fund == "0734" , "10734", as.character(fund)),
         fund = ifelse(fy<2023 & fund == "0820", "10820", as.character(fund) ) # fund reused for 3rd time
  ) %>%

 #  other fund changes noticed on April 1, 2024
  mutate(fund = ifelse(fy < 2023 & fund %in% c("0490", "0684", "0747", "0869"), str_replace(fund,"0", "9"), as.character(fund))) 
Tip

The funds_ab_in.xlsx file contains the following information: all funds that have existed since 1998, if they still exist, indicates if fund numbers have been reused for varying purposes, and is updated yearly with new fund numbers used by the IOC.

Code
agencies_exp <- exp_1998_2023 %>% 
  distinct(agency, agency_name) %>% 
  filter(!is.na(agency_name)) %>%
  mutate(agency = as.character(agency))

funds_exp <- exp_1998_2023 %>% 
  distinct(fund, fund_name) %>% 
  filter(!is.na(fund_name))

funds_ab_in_2023 = readxl::read_excel("data/funds_ab_in_2023.xlsx")


exp_temp <- exp_1998_2023 %>% 
  select(-fund_name) %>%
  arrange(fund, fy)  %>%
# join  funds_ab_in_2023  to exp_temp
  left_join(funds_ab_in_2023, by = "fund")



# Agency == 799 for Statutory transfers 
#  Object == 1993 is for Interfund cash transfers  
exp_temp <- exp_temp %>% 
  mutate(transfer = ifelse(org_name == "TRANSFERS", 1, 0),
         trans_agency = ifelse(org_name == "TRANSFERS", 
                               str_sub(obj_seq_type,1,3), NA),
         trans_type = ifelse(org_name == "TRANSFERS", 
                             str_sub(obj_seq_type, 4,9), NA)) %>%   
  
  mutate(
         object = ifelse(transfer == 0, as.character(str_sub(obj_seq_type, 1, 4)), NA_character_),  # appropriation object
         sequence = ifelse(transfer == 0, as.character(str_sub(obj_seq_type, 5,6)), NA_character_),
         type = ifelse(transfer == 0, str_sub(obj_seq_type, 7,8), NA_character_)    # appropriation type
         )
  • the initial combined and years of data are saved as dataframes named exp_1998_2023 and rev_1998_2023. These are then saved as exp_temp and rev_temp while recoding variables. This is BEFORE creating Fiscal Future category groups and further data cleaning done in the Calculating the Fiscal Gap page.

Update Agencies: Some agencies have merged with others or changed names over time.

[[ TO DO: Add appendix item of Current agencies (with indented past agencies) from the Methodology Document. ]]

Code
# recodes old agency numbers to consistent agency number
exp_temp <- exp_temp %>% 

  mutate(agency = case_when(

    (agency=="438"| agency=="475" |agency == "505") ~ "440",
    # financial institution &  professional regulation &
     # banks and real estate  --> coded as  financial and professional reg

    agency == "473" ~ "588", # nuclear safety moved into IEMA

    (agency =="531" | agency =="577") ~ "532", # coded as EPA

    (agency =="556" | agency == "538") ~ "406", # coded as agriculture

    agency == "560" ~ "592", # IL finance authority (fire trucks and agriculture stuff)to state fire marshal

    agency == "570" & fund == "0011" ~ "494",   # city of Chicago road fund to transportation
    TRUE ~ (as.character(agency)))) 

# 
# exp_temp <- exp_temp %>% filter(!is.na(fy))
# ## Fill in missing names for funds
# exp_temp <- exp_temp  %>%
#   select(-c(agency_name)) %>%
#   left_join(funds_exp) %>%
#   left_join(agencies_exp)

For aggregating revenue, use the rev_1998_2023 file, join the funds_ab_in_2023 file to it, and then join the ioc_source_type file. Remember: You need to update the funds_ab_in and ioc_source_type file every year!

rev_temp <- inner_join(rev_1998_2023, funds_ab_in_2023, by = "fund") %>% 
  arrange(source) 

funds_rev <- rev_temp %>% distinct(fund, fund_name_ab) %>% filter(!is.na(fund_name_ab))

agencies_rev <- rev_temp %>% distinct(agency, agency_name) %>% filter(!is.na(agency_name)) %>%
  mutate(agency = as.character(agency))



# need to update the ioc_source_type file every year! 
ioc_source_type <- readxl::read_xlsx("./data/ioc_source_updated23_AWM.xlsx")

rev_temp <- left_join(rev_temp, ioc_source_type, by = "source")

sources_rev <- rev_temp %>% distinct(source, source_name_AWM) %>% filter(!is.na(source_name_AWM))


# fund info to revenue for all years
rev_temp <- rev_temp %>% 
  select(-c(fund_name, agency_name, 
            source_name_AWM))  %>% 
  left_join(funds_rev) %>%
  left_join(sources_rev)





# automatically used source, source name does not match for the join to work using source_name

# recodes old agency numbers to consistent agency number
rev_temp <- rev_temp %>% 

  mutate(agency = case_when(
    (agency=="438"| agency=="475" |agency == "505") ~ "440", # financial institution &  professional regulation &
     # banks and real estate  --> coded as  financial and professional reg

    agency == "473" ~ "588", # nuclear safety moved into IEMA
    (agency =="531" | agency =="577") ~ "532", # coded as EPA
    (agency =="556" | agency == "538") ~ "406", # coded as agriculture
    agency == "560" ~ "592", # IL finance authority (fire trucks and agriculture stuff)to state fire marshal
    agency == "570" & fund == "0011" ~ "494",   # city of Chicago road fund to transportation

    TRUE ~ (as.character(agency)))) %>%
  #left_join(sources_rev) %>%
  left_join(agencies_rev) # add correct agency names back in

13.0.4 Export Intermediate Files

Code
write_csv(rev_temp, file = "./data/rev_temp.csv")

write_csv(exp_temp, file = "./data/exp_temp.csv")
Tip

The exp_temp.csv file and rev_temp.csv file are most likely the data that non-Fiscal Future researchers would want to use for their own analyses. These files should have consistent agencies and funds tracked from 1998 to the present.