Providing statewide education data has been at the core of CT Data's work since we began. Over the last year, we've been working primarily with data from the State Department of Education's EdSight portal. However, many of you have probably realized our datasets are stale.

Although we applaud the resources that the Department of Education has put into gathering and reporting district and school-level data through EdSight, we have been frustrated in our attempts to efficiently update and expand our education datasets. While EdSight contains more education datasets than we previously provided to the public, it does not contain aggregations that had provided small districts with some data rather than suppressions for small cell size year-after-year. When we requested the datasets in the format in which we had previously received them, we were told the department does not have the resources to fulfill such requests.

In order to continue to provide our users with education data that can be compared over time and across districts through our visualizations, we are compelled to restructure our datasets to more efficiently use the data that the Department provides through EdSight.  To do so, we developed a command line tool for addressing these problems directly. We are also releasing the bulk data that we scraped from EdSight as a datapackage on Github.

We will be updating and expanding our educational datasets over the next few weeks using the new data. Because of the way data are presented through EdSight, we will not be able to maintain some of the datasets we used to provide. It also means that some of the disaggregations we previously provided will no longer be available. Unfortunately, more data will be suppressed because aggregations by grade, for example, are not available on EdSight, and the education data we used to report, while still accessible, will not be comparable to the new data. The good news is that we will be able to provide more education datasets than we did before.

We will now list each individual disaggregation that is available as a separate dataset. For example, Chronic Absenteeism by English Language Learners, will be its own dataset. Our pre-EdSight SDE datasets will be listed under the Education -> Archived subtopic and will still be viewable through our visualization tools so that people can look at past trend data.


Challenges we face....

One of the things that we appreciate the most about EdSight is the way that data are structured. With a few exceptions, the data are relatively tidy,  and this is a big step forward for working with data within scripting and programming environments. That said, EdSight's user interface presented a number of challenges for us.

1. No bulk download

While we ultimately publish data in smaller, topical datasets, we prefer to work with bulk data whenever possible. EdSight lacks a mechanism to download "all" of the data, either across the portal or within a topic category. Within some datasets, some combinations of selections can produce a comprehensive version of the dataset, but this is not consistent.

2. Inconsistent interface

Some datasets are available directly from the main menu drop-downs. Others can only be reached by navigating within top-level categories (e.g. Discipline -> Bullying). The site lacks a central list of the datasets and reports being currently published. This makes it a challenge to keep track of what is available or to see the full list of available datasets in one view.

3. Suboptimal downloaded file structure

If you've ever downloaded a few files from a given dataset, you've probably noticed that they all receive the same name. This may be fine for 1 or 2 downloads, but this presents challenges for us as we try to process datasets that require us to combine multiple files. Furthermore, the downloaded CSVs are not immediately machine readable by our scripts. The first few rows of data contains a mix of descriptive data and data necessary to interpret the remaining rows. Specifically, year is only reported in this descriptive section and is not included as a column variable. This saves a few bytes in terms of file size but makes it a challenge to join multiple files.


Our solution

To solve these issues for ourselves (and, we hope, for others), we've developed a command line tool that allows for semi-bulk downloading and exploration of EdSight datasets.


How it works

The tool is fairly simple. It offers a way to list datasets and to display the variables available within a dataset.

Pass in the name of a dataset and the geography you want to retrieve (district or school). The tool will initiate a process of downloading the individual csv files, naming them in a more descriptive manner, and storing them where you tell it. We also offer an option to download the entire EdSight catalog in one command. However, because this process takes a fairly long time and because it generates a fair amount of web traffic, we don't recommend using this option. Instead, you can grab the full catalog from our Github repository or download it directly.


What are we doing with this tool

First, we are updating and expanding our own data catalog to include the full range of data published on EdSight. This means our users will soon be able to browse and visualize EdSight data on ctdata.org using our interface and visualization tools. For example, you will be able to compare multiple districts within a particular dataset over time.

Second, while our datasets will be extracts of the data available on EdSight, we will be making bulk files available for download. We think of our tools as the starting point for analysis and exploration, and we want our users to be able to take the data into whatever environment they are most comfortable with.

Third, we are releasing our tool as an open source resource and we welcome feedback, feature requests, and collaboration. We anticipate making a number of improvements to the tool, but we suspect that our users will find things to address that we have yet to consider.


What else we want to do with this

Right now this tool has focused on solving a specific problem that we faced in processing this data. However, we want to make this broadly useful if possible. 

We have a few ideas such as downloading all datasets but filtered by a set of districts/schools and having the option of preprocessing some of the datasets. However, we would welcome suggestions and requests from you. Please let us know if you have any ideas or feedback.