We spoke with Craig Statchuk, Big Data Architect at IBM, to learn more about the responsibilities of data architects. Below, Craig discusses the pros and cons of his job, how the data architect position has changed over time, and his advice for students interested in becoming data architects.
Q: WHAT ARE THE TOP PROS AND CONS OF YOUR JOB?
A: The good part is you start most days in the new big data world. This includes everything within the role of a big data architect – someone who fulfills the needs of the entire enterprise beyond IT. In effect, this role is about taking care of more users in more places. Therefore, the pro is that most days you’ll start with a clean slate. You may not know what the day holds for you, but by lunchtime, you’ll have a long list of things to work on, to create, and hopefully resolve in a short period of time. There’s a lot of value placed on immediate results. Instead of, “let’s come up with a great design or a great, long-term vision necessarily,” it’s more like, ‘I want to see the answers now.”
The con is that it doesn’t quite work that way. You have to keep the long-term vision in place because you’ll see more and more of the same kinds of questions and the same requests. So the architecture that allows you to respond quickly to a wide variety of questions will serve you very well.
There’s a tendency to do things quick and easy, but the truth is you need time to think and really plan, and time is a scarce resource in a typical workday. So finding time to think about what you’ve done right and how to move forward is really the secret to doing the job well. But right now, it’s the push and a pull that makes the job difficult.
Q: IS IT CHALLENGING TRYING TO BALANCE SHORT-TERM AND LONG-TERM NEEDS?
A: That’s exactly it. Everything is short term. It’s the “I need it yesterday” mentality. The problem with this is that it leaves little room to focus on quality or other issues, such as governance or the idea that you need to serve not only your users but also the business. Those two are always in opposition because the users want things now and the business wants things done right. You have to balance the two.
Q: WHAT KIND OF IMPACT DO DATA ARCHITECTS HAVE ON THE SUCCESS OF IBM?
A: The role is changing, and it’s growing quickly. A data scientist 10 years ago built the data warehouses and conducted the analysis in a constrained way. Nowadays, we’re seeing less demand for that, although it still exists in the office of finance, for example, since finance can easily categorize and quantify values according to accounting standards. The rest of the business wants the same kind of analysis so they can look at their data in a variety of ways, but they don’t necessarily have the rigidity or the structure necessary for that.
So what we’re seeing is the classic hype cycle with greater demand for new ways of looking at data. The truth is it’s not going to pan out as well as we hope. There’s going to be some disillusionment or dissatisfaction with the initial results. But at the end of the day, or let’s say at the end of two years, you will have an organization that is agile, able to answer more questions about the business and its customers, and more knowledgeable about what can be accomplished quicker and more accurately than they are today.
More data actually doesn’t make us smarter if we don’t have the ability to consume it. In some ways, it actually makes us less knowledgeable.
For instance, if you have more data, relatively speaking, but you lack the ability to process and understand it, then you know less than you used to. The way to combat that is to have systems that can adapt to the new data, understand and categorize it, and deliver it to more users quicker than before. That lets you turn the tide against big data. Without the proper resources in place, big data can result in significant confusion, but if it’s well-organized and well-provisioned, it can be the source to greater understanding.
So you have to balance that every day, which involves figuring out how to make the data more reusable.
Q: WHICH SKILLS OR PROGRAMMING LANGUAGES DO YOU MOST FREQUENTLY USE IN YOUR WORK, AND WHY?
A: Traditionally, we came from the world of Java and structured, traditional languages such as that. This was the language of the server, and we could it use with minimal modifications on the browser; so heritage played a big part in moving us in that direction. But nowadays, we’re seeing movement towards more flexible languages, more data and more statistically oriented language. My personal favorite is Python because it lets me be a computer scientist and have access to the statistics and other analytics that I need. Other people look at using both SPSS and languages such as R on a regular basis because they provide strong statistical packages, and the programming is often much easier and more accessible.
We’re moving away from the traditional languages towards these more ad hoc languages. People are very happy with that. At the end of the day though, it’s still based on the kind of application and what your needs are. We’re even seeing a huge influx in things like Node.JS to say, “Hey, I want to bring my JavaScript and my Java skills going back to the server.” We now have more languages than ever before, and this diversity affords us greater flexibility.
Q: HAS YOUR ROLE CHANGED OVER TIME? HOW DO YOU SEE IT EVOLVING MOVING FORWARD?
A: I graduated with a mathematics degree in 1983 from the University of Waterloo. At that time, we were very much into data structures, programming, and databases. We were taught things like first normal form and third normal form. This was standard for a good 20 years. We did structured programming and dealt with structured data. We worked with databases in a way that made users happy, and we answered certain types of questions very well. As we move forward, however, that data structure hasn’t been serving us as well.
For instance, we weren’t always able to understand just how quickly data was changing, and it took us a while before realizing that our way of processing could no longer keep up. So what we’ve arrived at is something that I can only call the new normal form. It represents data that’s just good enough and clean enough but not perfect. This is different from the past way of doing it, in which one piece of data points to other pieces of data, which then would lead us back to our understanding or something related to different pieces of data. Today, we have plenty of look-up tables and other things that help serve the business. The problem with those traditional data structures was that they didn’t allow us to answer enough questions.
The domain of questions that they could answer was really limited. But now we’ve come full circle with these giant spreadsheets. They represent rows and columns of data with lots of gaps, many inconsistencies, and lots and lots of columns. The new tools enable us to build good queries and build data that’s even more reliable than the stuff we used to ETO and put into those structured forms that I already mentioned.
The new solution is to take data and make it as reusable and as accurate as possible without sacrificing flexibility. That becomes a new role with a new focus. Now, I have to ask myself, “How do I produce data with maximum reusability within the company while also making it as accurate and as important as possible for the part of the business responsible for the systems of record that run the business?” You still have to service those, but now we have to service what we call “systems of engagement,” which is how we understand our customers, our employees, and even the products we build.
Q: WHAT KIND OF PERSON MAKES THE BEST DATA ARCHITECT?
A: Nowadays, data architects come with many different skills and backgrounds. For instance, unlike 20 years ago, a pure data or computer scientist background may not be as helpful. The new skills are to understand the needs of the user so that you can build data and systems that will answer their problems now and in the future. We have to become proactive, and I liken the job to that of an inventor. In other words, we have to invent the solutions that users are going to ask for, not necessarily tomorrow but six month from now or even two years from now. That requires someone who is both innovative and able to focus on the task today but also someone who can look into the future and say, “Hey, here’s what they’re likely going to ask about in two years. How do I create my data? How do I create my business to serve me better down the road?”
We don’t want to be chasing wild geese, but we want to be able to predict and do the kind of processing that users are going to need down the road. That’s a difficult job to do. This opens up opportunities for skilled candidates with a variety of backgrounds. For instance, you could have an art or a business degree, and then you can come into the technical side of the business. In fact, that may be the best possible way to get the broad understanding of the business and then the ability to actually execute it.
Q: WHAT ADVICE WOULD YOU OFFER STUDENTS PREPARING FOR A POSITION AS A DATA ARCHITECT?
A: So I think the best piece of advice I can give is to become an expert. Become the absolute best you can be at a particular field of interest. I don’t care if that’s accounting, psychology, or data management. In five years, your job will be totally different than it is today. And you’re going to have to learn new skills all over again. The only way to survive is to anticipate that you’ll have to become an expert in a new set of skills in order to meet future demands. I need to do that every few years in my career. You have to get used to it, and you have to get good at it.
The ability to turn your career towards the next big thing, whether it’s Hadoop, Spark or maybe data preparation for a line of business, is essential. You’re going to have to understand that, and you’re going to have to help the people who need to do these functions by being the expert they can trust.
Q: DO YOU HAVE ANY FINAL THOUGHTS OR FEEDBACK THAT YOU WOULD LIKE TO SHARE WITH STUDENTS?
A: I think the data field is exploding, but our ability to process it is falling behind. The solution for the industry is scalability. However, I don’t think it’s in the way that we traditionally think about throwing more hardware at the problem. We need the ability to share and collaborate more so that everyone in a business participates. This allows us to divide the efforts and move in a common direction.
But I look at data science today, and I see we have individual people doing the entire analysis from start to finish – gathering the data, cleaning it, doing the analytics, creating the visual innovation and presenting the results. The problem standing in the way of scalability is that, for each step along the way that I get that data science activity, all that knowledge is lost as soon as I complete my task. We need to move the industry towards sharing the work and sharing the role so that one person produces a reusable data set, the next person produces reusable analytics on top of it, and then we can all present the results more quickly and accurately.
I think this funnel is going to eventually constrict to the point that we will have to do something better. However, hiring more data scientists isn’t necessarily the answer. Creating more processing power or even SPARC 2 [SP], which gives us more processing power than we ever imagined, isn’t part of the solution, either. Rather, it involves getting teams of people to move each processing step forward, and then reusing that work so we don’t keep reinventing the wheel and doing everything from the ground up.
We used to say, in business analytics, you had a single version of the truth. Now we have to move towards the best-supported, most acceptable version of the truth, which allows us to get to the truth quicker and easier. As a team, we are providing the data and then sharing the results so that we can all reuse them and turn them into answers for the benefit of the business.