Table Prep: Data loads and time zones

Next episode

My Autonomous Database in Oracle Cloud Infrastructure
My Autonomous Database in Oracle Cloud Infrastructure

With my Autonomous Database up and running,I needed to find some data to upload. After some quiet reflection, I decided upon three datasets that should be plenty fun.

Restaurants in Wake County, North Carolina. Wake County is where I, Chris Hoina, live.
Restaurants in Wake County, North Carolina. Wake County is where I, Chris Hoina, live.

A bit of context; I currently live in Wake County, North Carolina. And it turns out that we’ve been very progressive with providing open, accessible data to our citizens. I really wanted to do something with food, so I was surprised when I found the following three datasets (among the nearly 300 available):

  1. Restaurants in Wake County
  2. Restaurant Food Inspections/Grades
  3. Restaurant Food Inspection Violation History

Upon reading the documentation, I discovered that these datasets could be all connected through their HSISID* Permit ID fields. Which should come in handy later as it could make for some very interesting (if not amusing) revelations.

*I’m not sure of the acronym, but the documentation defines the HSISID as a, “State code identifying the restaurant (also the primary key to identify the restaurant).”

To the data loading

I’m sticking with the Oracle Database Actions* web-based interface for now.

In the Database Actions Launchpad, there are several actions for users. This includes development, data-loading, monitoring, and administration.

*Oracle Database Actions was previously known as Oracle SQL Developer Web.

And from what I learned in the Autonomous Database workshop, there are two ways to load data into your database (there may be other methods; mea culpa if so):

  1. SQL worksheet (found in the Development section, of your launchpad)
  2. Data Load Tool (found in the Data Tools section)

And I opted for the SQL worksheet method, since I need to practice and learn this method regardless.

Obstacles

After choosing my method of data loading, here is where I encountered some obstacles. With complete lack of regard for best practices, I proceeded to load all three datasets into my Autonomous Database only to discover two glaring issues.

Unfortunately, in my restaurant “violations” table, I encountered two rows that failed to load.

I received two ORA-12899 errors on these rows. Turns out, I needed to allocate more byte space in these columns.
I received two ORA-12899 errors on these rows. Turns out, I needed to allocate more byte space in these columns.

These values were loaded as a VARCHAR2 datatype; which by default, allows for a maximum byte size of 4000 bytes. Since these rows exceeded 4000 bytes (in that specific column) they failed. Fortunately, it was easy to increase the bytes to 10,000.

Apparently, I can increase to 32,767 bytes, but that is overboard. I also learned that the VARCHAR2 data type can have their max string sizes set to ‘STANDARD’ or ‘EXTENDED’ (i.e 4000 bytes vs 32,767 bytes). I’m assuming the Autonomous Database is set to EXTENDED by default, since I was able to increase this column, and re-upload with zero failed rows. You can read up on VARCHAR2 data types here.

The second obstacle I ran into, took me about a day (on and off) to figure out. The dates in all three of these tables were constructed like this:

The date/time field in this .CSV file was unique. Turns out, it was a field type in ESRI, that mapped to Oracle’s TIMESTAMP datatype.
The date/time field in this .CSV file was unique. Turns out, it was a field type in ESRI, that mapped to Oracle’s TIMESTAMP datatype.

And at first glance, I thought I was looking at a time zone offset. Dates were either 04:00:00+00 hrs or 05:00:00+00 hrs; possibly due to daylight savings time. Which seemed like a reasonable assumption since the early-November — mid-March dates were an hour later than the mid-March — early-November dates (i.e. UTC-4hrs in Winter, UTC-5hrs all else).

My first thought was to modify the effected columns with something like this parameter:

TIMESTAMP [(fractional_seconds_precision)] WITH TIME ZONE

or this…

TIMESTAMP '1997-01-31 09:26:56.66 +02:00'

But, I’d never seen this date/time stamp before (the ‘esriFieldTypeDate’), so I had to investigate before I did anything (also, I’ve no idea how to update a column with a parameter like this, so it would have added additional time).

Turns out this data originates from ArcGIS Online. This field date type maps directly to Oracle’s ‘TIMESTAMP’ data type. Seen here in the documentation:

In the ArcGIS documentation, it looks like the ‘esriFieldTypeDate’ maps to Oracle’s ‘TIMESTAMP’ datatype.
In the ArcGIS documentation, it looks like the ‘esriFieldTypeDate’ maps to Oracle’s ‘TIMESTAMP’ datatype.

I later tinkered with an API Explorer on the Wake County datasets site and reviewed a sample JSON response for this data set. The response also confirmed the field type:

Wake County has an API Explorer on their datasets site, which you can sample query to review the JSON that is returned.
Wake County has an API Explorer on their datasets site, which you can sample query to review the JSON that is returned.

It’s funny, because after all this hunting, it looks like I won’t be able to include time zones anyways. It’s somewhat inconsequential as I’m only interested in Wake County, but for some that could be unnerving.

I wish I could say that I came up with a cool Python script to automate the cleaning of these columns (I would have had to do this for all three of my .CSV files), but I did not. It would have taken me an afternoon or longer to work through this since I haven’t toyed with .CSV Reader (in the Python standard library), Pandas, or NumPy in quite some time. In the end, (and it pains me to say this; embarrassingly really), I used the ‘Find + Replace’ feature in Numbers to remove the time/time zone portion of the dates.

I’m pathetic, I know. But the idea of creating a dedicated Python script does seem like a fun weekend project. Who knows, maybe I’ll post to GitHub. Regardless, my Autonomous Database is clean and primed for ORDS.

In closing

My advice is as such:

Always inspect your data before you do anything major, always make sure you understand the inputs and the resultant outputs of the data loading process, and always refer to the documentation FIRST so you can understand how data is being recorded, coded, displayed, etc..

Oh, and it wouldn’t hurt to think about how you might use this data in the future. I’m guessing that could help you avoid a lot of rework (i.e. technical debt).

Take me as an example. Since I’m focused on ORDS and REST enabling my database, I may want to consider things like:

  • What am I going to expose (via REST APIs)? or
  • What fields might the presentation layer of my application include?

In my case, a lot of this data will lay dormant, as I might choose to focus on only a few key fields. Of course, anything I create will be simple and basic. But that might not be the case for you.

Photo by Jordan Whitt on Unsplash

I’m off to REST…enable this database now, but in the meantime if you want to learn more about ORDS, check out:

Find me

And as always, catch me on:

--

--

--

I’m a Senior Product Manager at Oracle. But I’m more than that. I’m not to be objectified, like some shiny toy or trinket…

Love podcasts or audiobooks? Learn on the go with our new app.

A guide to Canadian football with Python

Creating Shared Value and Data Democracy — Adding sustainable value to our society

mapPartitions in a PySpark Dataframe

Insights-as-a-Service ‼️

Data Analysis: Best Tips For Deciphering Information

Enhancing Our Basic, In-Game Win Probability Model for the NFL: Random Forests

Vehicle localization using a Particle Filter

A challenge for Data Challenge

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Chris Hoina

Chris Hoina

I’m a Senior Product Manager at Oracle. But I’m more than that. I’m not to be objectified, like some shiny toy or trinket…

More from Medium

Data Mapping Best Practices | A Guide to Types, Approaches, Tools

First Reaction Review: Big Data: A Revolution That Will Transform How We Live, Work, and Think.

Big Data​: Are You Ready For Fungible Data?

Exploring the infinitely small — A particle collision at the LHC in Geneva.

Import/Export Using Apache Sqoop