Apache Spark Foundation Course - File based data sources


Hello and welcome back to learning Journal. I have already talked about loading data from a CSV source file. In this video, we will extend the same idea and explore some other commonly used data sources. We will not only load but also explore the process of writing data to a variety of file formats. We will primarily work with five different types of data files.

  1. Parquet
  2. JSON
  3. ORC
  4. AVRO
  5. XML

In this video, we will restrict ourselves to work with data files. However, this video will give you the core concepts of working with a source system. Once you have these core concepts, you will be able to work with a variety of other third-party data sources. We will also explore a couple of third-party sources in the next video to help you extend your learnings to virtually any other Spark source and destination. Great! Let's start.
The first thing to understand is the general structure of the read and write APIs. To read data from a source, you need a reader. That is where we have a DataFrameReader interface. Similarly, to write data, you need a writer. That is where we have a DataFrameWriter interface. These two interfaces are a kind of Spark standard for working with any source system. There are few exceptions to this rule for an appropriate reason. However, if the source system can offer you a clean row-column structure, you should get these two interfaces. There is one crucial point to note here. I am not talking about reading something which doesn't qualify to have a schema. If the data is structured or semi structured where you can define a row-column schema, you should be using a DataFrameReader and a DataFrameWriter. Every standard Spark 2.x connector follows this convention. You want to connect to Cassandra, RedShift, Elastic search, or even sales force they all offer you a DataFrameReader and a DataFrameWriter.
When you don't have a schema, or the data source provider doesn't offer a Spark 2.x connector, you will fall back to RDDs, and we will talk about reading RDDs in a later video.

How to readfrom Spark data sources

We have already used DataFrameReader for CSV. So, you already know the general structure. For a typical reader, there are four methods.

  1. format
  2. option
  3. schema
  4. load

We have already seen these in action for CSV. The format method takes a string to define the data source. It could be one of the following values.

  1. parquet
  2. json
  3. orc
  4. com.databricks.spark.avro
  5. com.databricks.spark.xml

The first three formats are part of the Spark Core packages. However, other two are still separate, and hence if you want to use them, you must manage the dependency. Similarly, for other connectors, you should check out their respective documents.
The next one is the option method. We have already used options for CSV. You can pass a key-value pair to the option method. And you can add as many option methods as you want. Right? The number of valid options and allowed key-value pair depends on the connector. We have seen a long list of CSV options, and similarly, you can get a list of supported options for the first three items (Parquet, JSON, ORC) in the Spark documentation. For other connectors, you need to check their respective documents.

  1. Databricks Avro connector
  2. Databricks XML connector

The next method is the schema method. We already learned that we could infer the schema from the source system or we can use this method. There are two alternatives to specify a custom schema. We have already seen the StructType while working with CSV. You also have another simple choice. Specify a DDL like string.

Finally, you have the load method. In case of a file-based data source, you can pass a directory location or a file name to the load method. For other sources like JDBC, there is no file. So you would call it without any argument.

How to handle malformed records

So far so good. But what happens if you get a malformed record. That's a natural condition. When you are reading hundreds of thousands of rows, some of them may not confine to the schema. Those are incorrect records. What do you want to do with those?
Spark DataFrameReader allows you to set an option called mode. We have seen that as well for the CSV example. You can set one of the three values.

  1. permissive
  2. dropMalformed
  3. failFast

The default value is permissive. If you set the mode to permissive and encounter a malformed record, the DataFrameReader will set all the column values to null and push the entire row into a string column called _corrupt_record. This corrupt record column allows you to dump those records in a separate file. You can investigate and fix those records at the later stage. The dropMalformed option will quietly drop the malformed records. You won't even notice that some rows were not accurate.
And the failFast method raises an exception.
Great! I just want to make the last point about the readers.
While working with CSV, we have used the DataFrameReader method, and we also used a simple CSV method. The CSV method is a shortcut method for the DataFrame reader. While learning about other connectors, you might also find some other shortcut methods for those connectors. You can use them if you want to do that. However, I recommend you to follow the standard techniques. That makes your code more readable for an extensive team.


How to write Spark data frames

Let's quickly cover the DataFrameWriter and then we will look at some examples.
The concept of writing data back to the source system is fundamentally same as the reading data. You need a writer, and hence we have a DataFrameWriter interface. For a typical writer, there are four methods.

  1. format
  2. option
  3. mode
  4. save

The format method and the option method are like the DataFrameReader. You need to check the documentation for the available options.
The save method is like a load method. It takes the output pathname and dumps the data at the specified location.

How to handle file already exist condition

However, while working with a file-based destination, you may encounter a file already exists situation. To handle that scenario, we have a mode method that takes one of the following options.

  1. append
  2. overwrite
  3. errorIfExists
  4. ignore

All the values are obvious. The third option will raise an exception, and the last option will keep quiet. I mean, if the file already exists, just ignore without raising any error.

Additional Spark Data frame write options

If you are working with file-based destination system, you also have three additional methods.

  1. partitionBy
  2. bucketBy
  3. sortBy

The partitionBy method allows you to partition your output file. And the bucketBy method allows you to create a finite number of data buckets. Both methods follow the hive partitioning and bucketing principals. I hope you are already familiar with partition and bucket concept. If not, write a comment below, and I will share some relevant content.
These extra methods may not be available with other connectors. However, you need to check their respective documents.


Spark Data Source Examples

Great! That's all about theory. All this knowledge is useless until you develop some skills around it. And you can convert your knowledge into skills by writing some code. So, let's create a simple example and write some code.
I am going to do following things.

  1. Read CSV -> Write Parquet
  2. Read Parquet -> Write JSON
  3. Read JSON -> Write ORC
  4. Read ORC -> Write XML
  5. Read XML -> Write AVRO
  6. Read AVRO -> Write CSV

By doing these simple exercises, we will be able to learn all the file formats that I talked in this lesson. So, let’s start.

Working with CSV in Apache Spark

Here is the code to read a CSV and write into a Parquet format.

I have already explained the CSV reading in an earlier video. So, let's focus on the writing part. We start by calling a write method on a data frame. The write method returns a dataFrameWriter object. And then, everything else is simple. We call necessary dataFrameWriter methods. I hope you are already familiar with the parquet file format and its advantages. Since parquet is a well-defined file format, we don't have many options as we had in CSV. In fact, parquet is the default file format for Apache Spark data frames. If you don't specify this format, the data frame will assume it to be parquet. We are setting the mode as overwrite. So, if you already have some data, you will lose that, and spark will refresh the directory with the new file.


Working with JSON in Apache Spark

Let's move on to the next example. This time, we read parquet and write JSON.

The code is straightforward. In fact, it follows the same structure as the earlier example. Let me execute the code. Here is my JSON file. You pick one record and compare it with your original CSV. In these examples, we simply read data in one format and write it back in another format. We are not applying any transformations. However, in your real project, you will read data from a source, apply a series of transformations and then save it back in same or a new format. And you can do that easily as long as you have a data frame as your final result. We are also doing the same thing, saving a data frame. Right?

Working with ORC in Apache Spark

Here is an example to read a JSON and write it back to an ORC format. There is nothing new for me to explain here.

Working with XML in Apache Spark

The next example is to read from ORC and write it to XML. The XML connector is not part of the Spark distribution. It is an opensource connector that Databricks has created and certified. Similarly, our next example would be using AVRO. That's again a Databricks certified connector but not available as part of the Spark distribution. You can get the documentation and other details about these connectors on GitHub.
To use these connectors in the Spark Shell, you must include these dependencies in your Spark shell. The method is already given in their respective documentations. Start the Spark Shell using following command.

This command specifies the group id, artefact id, and the version. That's it. And the Spark dependency manager will pull the required Jar from maven repository and include it in your class path.

Working with AVRO in Apache Spark

The code example below reads an XML file into Apache Spark data frame and writes it back to AVRO file.

Thank you for watching Learning Journal. Keep learning and keep growing.


You will also like: