Mastering Data Formats: A Quick Guide to Reading CSV, JSON, XML, HTML, SQL, and Parquet Files

Ankit Gupta
4 min readJul 18, 2024

--

Photo by Samantha Hurley from Burst

In the realm of data analysis, information comes in a variety of shapes and sizes, each requiring specialized processing to ensure efficient storage, transmission, and analysis. Fortunately, there is a wide range of standardized file formats designed to meet these objectives. Whether you’re working with tabular data, hierarchical structures, web content, relational databases, or big data, understanding these formats and how to read them is critical for any data professional.

Exploring Common Data Formats and How to Read Them

1. CSV (Comma-Separated Values)

CSV files are perhaps the simplest and most widely used format for storing tabular data. Each line in a CSV file represents a row, and fields within each row are separated by commas (or other delimiters).
Example:

---usual structure of CSV data
Name,Age,City
John,28,New York
Alice,25,Los Angeles
Bob,30,Chicago
import csv

with open('data.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
print(', '.join(row))

2. JSON (JavaScript Object Notation)

JSON is a lightweight data interchange format used to structure data in a readable format. It supports nested data structures, making it ideal for representing complex hierarchical data.

JSON is prevalent in web development for transmitting data between a server and web application. It’s also used for configuration files, APIs, and NoSQL databases.
Example:

---usual structure of JSON data
{
"employees": [
{"firstName": "John", "lastName": "Doe", "age": 30},
{"firstName": "Anna", "lastName": "Smith", "age": 25},
{"firstName": "Peter", "lastName": "Jones", "age": 45}
]
}
import json

with open('data.json') as jsonfile:
data = json.load(jsonfile)
print(data)

3. XML (eXtensible Markup Language)

XML is a markup language designed to store and transport data with a focus on simplicity and readability. It uses tags to define data elements and attributes.

XML is widely used in web services, configuration files, data interchange between different systems, and as a format for storing semi-structured data.
Example:

---usual structure of XML data
<data>
<person>
<name>John</name>
<age>28</age>
<city>New York</city>
</person>
<person>
<name>Alice</name>
<age>25</age>
<city>Los Angeles</city>
</person>
<person>
<name>Bob</name>
<age>30</age>
<city>Chicago</city>
</person>
</data>
import xml.etree.ElementTree as ET

tree = ET.parse('data.xml')
root = tree.getroot()
for person in root.findall('person'):
name = person.find('name').text
age = person.find('age').text
city = person.find('city').text
print(f"Name: {name}, Age: {age}, City: {city}")

4. HTML (HyperText Markup Language)

HTML is the standard markup language for creating web pages and web applications. It structures content using tags, defining elements such as headings, paragraphs, and lists.
Example:

---usual structure of HTML data
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Sample HTML Page</title>
</head>
<body>
<h1>Hello, World!</h1>
<p>This is a sample HTML page.</p>
</body>
</html>
from bs4 import BeautifulSoup

with open('index.html') as html_file:
soup = BeautifulSoup(html_file, 'html.parser')
print(soup.prettify())

5. SQL (Structured Query Language)

SQL is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS).
Example:

---usual structure of SQL data
CREATE TABLE employees (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
age INTEGER,
city TEXT
);

INSERT INTO employees (name, age, city) VALUES ('John', 28, 'New York');
INSERT INTO employees (name, age, city) VALUES ('Alice', 25, 'Los Angeles');
INSERT INTO employees (name, age, city) VALUES ('Bob', 30, 'Chicago');
import sqlite3

conn = sqlite3.connect('example.db')
cursor = conn.cursor()

cursor.execute('SELECT * FROM employees')
rows = cursor.fetchall()
for row in rows:
print(row)

conn.close()

6. Parquet

Parquet is a columnar storage file format optimized for use with big data processing frameworks. It is efficient for both storage and processing of large datasets.

Parquet files are commonly used in big data environments such as Apache Hadoop, Apache Spark, and other data processing frameworks. They are useful for analytical inquiries and data warehousing because of their efficient storage and retrieval capacities.
Example:

'''Parquet data itself isn't human-readable like CSV or JSON. It's a binary
format designed for efficient storage and retrieval by computers. Since we
can't directly view the binary data, the following is just a representation
for understanding.'''

data.parquet


| Name | Age | City |
|-------|-----|------------|
| John | 28 | New York |
| Alice | 25 | Los Angeles|
| Bob | 30 | Chicago |
import pandas as pd

df = pd.read_parquet('data.parquet')
print(df)

Finally, learning the nuances of these basic file formats — CSV, JSON, XML, HTML, SQL, and Parquet — enables data professionals to confidently and proficiently navigate the diverse world of data analysis. Understanding how to read and use these formats is critical when working with structured tabular data, parsing hierarchical information, managing web content, querying relational databases, or processing large datasets.

Thanks for your time!

--

--

Ankit Gupta
Ankit Gupta

No responses yet