export default {
    id: '2020-04-30',
    year: 2020,
    month: 4,
    date: 30,
    title: `Developer Story: Seed Script End Game`,
    blog_url: `https://keithvictordawson.medium.com/developer-story-seed-script-end-game-f85e9ecf29a8`,
    image_url: `https://images.unsplash.com/photo-1544383835-bda2bc66a55d`,
    image_caption: `Photo by <a class="text--primary" href="https://unsplash.com/@jankolar">Kolar.io</a> on <a class="text--primary" href="https://unsplash.com/s/photos/database">Unsplash</a>`,
    contents: [
        {
            type: 'text',
            content: [
                {
                    type: 'text',
                    content: `Looking back over the past month since I wrote my `,
                },
                {
                    type: 'internal_link',
                    year: 2020,
                    month: 3,
                    date: 31,
                    content: `previous`,
                },
                {
                    type: 'text',
                    content: ` developer story entry, I am surprised. For one thing, I am surprised how quickly another month has passed by while I have been diligently working on my project. On the other hand, I am amazed how far I have come in that month. When I began my project in October of last year and found the initial dataset that I would use to fill my project database, I had estimated that it would take no more than a couple of months to sift through all of that data, clean it up, and put the clean data into a set of seed scripts with which I could fill the initial version of my database. And at the time, I felt like that was going to be an overestimation and that it may well take much less time to complete this task. I could not have been more wrong.`,
                },
            ],
        },
        {
            type: 'text',
            content: `The original form of the dataset that I decided to use for my project was one huge set of text from a single website page. There were almost eighty four thousand lines of text in this dataset, but they could not simply be used as they were in that original form. Many of the lines of text would actually result in multiple entries in my database, based on how I chose to build the schema for my database. In order to translate that massive amount of text into a full collection of data in my database, I knew that there were three steps through which I would need to put it.`,
        },
        {
            type: 'text',
            content: `The first of those steps was to divide the data into more manageable pieces, as I knew that taking on the full set as one whole would require an unacceptable amount of time. The second step would be to go through each of those pieces, line by line, to clean up the data that was there and mold it into a cohesive and consistent form. The third step would be to translate those cleaned data pieces into seed scripts, which would then be easy to work with in a database migration script process.`,
        },
        {
            type: 'text',
            content: `As I mentioned in my previous developer story entry, the first step was easy. By dividing the full dataset based on the first character of each line of text, I was able to create twenty four pieces. Based on the size of the initial dataset, this resulted in each piece containing roughly three thousand five hundred lines of text, on average. When looking at an individual piece, this was much more manageable. As with any language, however, there is not an even distribution of words beginning with each letter of the alphabet. Some of the data pieces were much larger than the average, while some were much smaller than the average. There would be a large variance in the amount of time that it would take for me to complete work on each data piece.`,
        },
        {
            type: 'text',
            content: `With the dataset divided, reviewing and cleaning up all of the text was able to go rather quickly. In order to help me with the review and cleanup process, I put Google Docs to good use. I found using the advanced search and replace functionality provided by Google Docs to be quite useful in identifying and cleaning up problematic data especially since it allows for the use of regular expressions when searching. In order to jump ahead to the final step, I have been focusing most of my attention on reviewing and cleaning up all of the smaller data pieces. Doing so has also allowed me to identify common problems in the smaller data pieces first so that I know what to look for when dealing with the larger data pieces later. As of the writing of this developer story entry, I have fully reviewed and cleaned up all but the two largest data pieces in my dataset. Every other data piece has been cleaned and is ready to be translated into a seed script.`,
        },
        {
            type: 'text',
            content: `With the majority of the data pieces reviewed and cleaned, I have made a very good amount of progress on the third and final step. Originally, I had been creating the seed scripts from cleaned data by hand because I was not yet sure what final form the seed scripts would take. But once I had a good enough idea and could identify which parts of the seed script creation process can be handled by automation, I developed a bit of software to take the data in its Google Docs form and directly output seed script code. Once I had this software in my toolkit, I was able to drastically speed up my seed script creation process.`,
        },
        {
            type: 'text',
            content: `Now keep in mind, this software could not translate absolutely everything correctly in the seed script creation process. I still need to run this software manually, feeding chunks of each data piece in and reviewing the code output for any necessary changes. But this is a small price to pay for improving my data translation process speed from about one piece per day or two when completed manually to about three pieces per day with the software assistance. And all of the data pieces that I have been translating with the software have been much larger than those that I was translating by hand. As of the writing of this developer story entry, I have finished seed scripts for two thirds of the data pieces in my dataset.`,
        },
        {
            type: 'text',
            content: `Looking back on the amount of effort that I have put into getting my language dataset in order, the process has been bumpy, but my persistence has paid off nicely as I am now well on my way to fully completing the seed script creation process. Without the automation software I have created to help me with translating the cleaned data pieces into seed scripts, if you had asked me at the time of my previous developer story entry how long it would take for me to complete the process I would have guessed another couple of months. Luckily, I have brought that estimation down substantially and I would guess right now that I will fully complete the seed script creation process within the next week.`,
        },
        {
            type: 'text',
            content: `As I have mentioned previously, once that process has been completed, my mind will be fully freed up to focus all of my thought, energy, and effort on the applications that I have already created for the system that makes up my project. I will be able to dedicate all of my imagination to dreaming up exciting new directions to take my project and developing new and useful features for those applications. Completing this process will mean that I have met another major milestone that I had set for myself way back in October of last year. Please stay tuned for more developer story entries as I put the bulk of my database work for my project behind me and begin to fully focus on the development of the applications that constitute my personal project.`,
        },
    ],
}