Here at the Connecticut Data Collaborative, we are no strangers to all types of data. We pride ourselves on the effort it takes to transform raw tabular data from disparate sources into the tidy format you see on

I'd like to call out one data set in particular that requires a little extra attention; explain some of the things we see in the raw data, and share some of my data cleaning steps. The data set I will be talking about is the Accidental Drug Related Deaths by Drug Type data set, sourced from the Connecticut Office of the Chief Medical Examiner, accessed via

Accidental Drug Related Deaths by Drug Type reports totals and subtotals of deaths attributable to accidental drug overdoses by place of death as reported by the Connecticut Office of the Chief Medical Examiner. Deaths are grouped by age, race, ethnicity, and gender and by the types of drugs detected post-death.

This data set is aggregated by drug type, based on the presence of opioids. Opioids are a class of drug that include heroin, fentanyl, and pain-relievers such as oxycodone and morphine.* The Opioid Crisis has been heavily discussed all across the country, including here in Connecticut. Being able to identify hyper-local areas in Connecticut where the epidemic has had it's greatest impact helps communities determine where services are needed the most. This data set helps to identify those areas, making the data preparation and cleaning steps that much more important.  

To begin, I'd like to explain why we report totals by "place of death" rather than "place of residence". Ideally, if data consumers need to determine where community services, programs, and interventions should take place, the targeted areas should be based on where affected individuals live, not necessarily where they die. However, I've observed that in a data set such as this, when the data are entered on an individual basis, some data fields, especially "place of residence" have lower quality, accuracy, and/or completeness. While the data field "place of death" tends to be more accurate based on the manner in which is collected, i.e. from the hospital records, death certificates, etc. 

In fact, over the period of 2012-2017, there are over 130 cases in this data set where the "place of residence" is entered as either "Unknown" or have been left blank. This accounts for about 3% of the data set and since the data are reported at the town level, smaller geographies tend to suffer the most when there are missing data. In addition, this data set includes over 140 cases where the "place of residence" is out of state. When we include towns outside of Connecticut, this makes it difficult to determine exactly what town is reported. If the town is reported without a state or even without a county, it's difficult to determine the exact town reported. 

Interestingly enough, during the data cleaning process, I learned that there is a town called Hamilton, in Madison County, New York, and a township called Hamilton in Madison County, Indiana. In addition, there is a neighborhood called Hancock, in Hillsborough County, Florida, and a town called Hancock, in Hillsborough County, New Hampshire! Now you can see if only a town and county is reported, we can't confidently report the data from these towns, without knowing the state. For clarification, there were only three cases in this data set where, "place of death" was not reported and all places of death were in Connecticut.

Cleaning/wrangling the data: the steps we take in order to identify the types of drugs present at death. Below are screenshots from the raw data and how I arrive at a 'clean' dataset. 

  • If the ImmediateCause field includes the name of a drug, and that drug is not flagged in the raw data (notice the Cocaine column is empty), we must manually adjust the record to flag the corresponding drug.

  • Our data set includes an Other drug type category. Other reports deaths not otherwise attributable to any enumerated substance or to Any Opioid. In some cases, individual death records indicate an identifiable opioid in the Other field (as below) or in the ImmediateCause field. In these cases, we have flagged those deaths as Any Opioid. 

  • Finally, if an individual opioid has been flagged (Fentanyl), we must make sure the Any Opioid column is also flagged.

The Connecticut Data Collaborative provides these aggregations in order to address problems stemming from individual deaths being attributable to multiple substances. We will be releasing a series that examines the opioid crisis in Connecticut.