Frequently Asked DataStage Interview Questions

What are environment variables and how to initialize them?

Environment variables are predefined variables which we can use while creating a DS job. We  can create or declare these variables in DS Administrator while designing the job and set the properties for these variables. DataStage environment variables can be set in 3 levels, DataStage instance level, DataStage project level and DataStage job level, from lowest to highest precedence.

What is a data file and a descriptor file?

The data file contains the data and the descriptor files contains all the information or description and keeps a copy of config file at the time the dataset is created hence preserves the partition.

How do you remove duplicate values in DataStage? 

We can use remove duplicate stage to eliminate duplicates.

We can also use Sort stage to remove duplicates. It has a property called ‘allow duplicates’. We won’t get duplicate values in the output of sort on setting if this property is set as ‘false’.

What is the use of ICONV() and OCONV() functions?

These are used to convert formats from one format to another. i.e. Conversions of roman numbers, time, date, ASCII etc.

ICONV() is basically used to convert formats for system to understand.

OCONV() is basically used to convert formats for users to understand.

What are Datastage Triggers?

Datastage triggers are used to control job activities in job sequencers. In a job sequence stages are called ‘Activities’ and links are called ‘Triggers’.

Triggers are of 3 types:

1. Conditional

2. Unconditional

3. Otherwise

What is a difference between Server and Parallel jobs?

The server job don’t support partitions and no parallelism and it runs only on SMP machines. Hence the performance is low as compared to parallel jobs. Parallel jobs supports partitions and parallelisms and it can run on SMP, MPP or clustered machines.Hence the performance is high.

What does NLS mean in DataStage?

NLS means National Language Support in DataStage. This means we can use this IBM DataStage tool in various languages like multi-byte character languages like Chinese or Japanese too. You can read and write in any language and process it as per the requirement.

What is the difference between OLAP and OLTP?

OLTP systems contains normalized data where as OLAP system contains denormalized data. OLTP stores current data whereas OLAP contains current as well as historical data for analysis purpose.

Query retrieval is faster in OLTP as compared to OLAP and hence contains the transactional data.

What are routines in Datastage?

Routines are basically a set of functions that are defined by DS manager. It is run via transformer stage.

There are 3 types of routines

  • Parallel routines
  • Mainframe routines
  • Server routines

What is OSH?

Orchestrate Shell (OSH) is a program that is patterned on a UNIX shell that implements a text-based command language (also called OSH), and that controls the parallel processes that implement a job.The OSH command is the main program of the InfoSphere Parallel Engine. This command is used by DataStage to perform several different tasks including parallel job execution and dataset management.

To run this command there are 3 environment variables that must be set.
  1. APT_ORCHHOME should point to Parallel Engine location
  2. APT_CONFIG_FILE should point to a configuration file
  3. LD_LIBRARY_PATH should include the path to the parallel engine libraries.

What is a configuration file in Datastage?

The Datastage configuration file is a master control file (a textfile which sits on the server side) for jobs which describes the parallel system resources and architecture. The configuration file provides hardware configuration for supporting such architectures as SMP (Single machine with multiple CPU , shared memory and disk), Grid , Cluster or MPP (multiple CPU, mulitple nodes and dedicated memory per node). DataStage understands the architecture of the system through this file.

The configuration files have extension ‘.apt’. The main outcome from having the configuration file is to separate software and hardware configuration from job design. It allows changing hardware and software resources without changing a job design. Datastage jobs can point to different configuration files by using job parameters, which means that a job can utilize different hardware architectures without being recompiled.

APT_CONFIG_FILE is the file using which DataStage determines the configuration file.

Define Node in a configuration file.

A Node is a logical processing unit. Each node in a configuration file is distinguished by a virtual name and defines a number and speed of CPUs, memory availability, page and swap space, network connectivity details.

What are the different options a logical node can have in the configuration file?

  1. fastname –The fastname is the physical node name that stages use to open connections for high volume data transfers. The attribute of this option is often the network name. Typically, you can get this name by using Unix command ‘uname -n’.
  2. pools –Name of the pools to which the node is assigned to. Based on the characteristics of the processing nodes you can group nodes into set of pools. A pool can be associated with many nodes and a node can be part of many pools.
  3.  resource –  resource resource_type “location” [{pools “disk_pool_name”}]  | resource resource_type “value” . resource_type can be a hostname.

Example : Configuration file

{
node “node1″
{
fastname “DS1″
pools “”
resource disk “C:/IBM/InformationServer/Server/Datasets/Node1″ {pools “”}
resource scratchdisk “C:/IBM/InformationServer/Server/Scratch/Node1″ {pools “}
}
node “node2″
{
fastname “DS2″
pools “”
resource disk “C:/IBM/InformationServer/Server/Datasets/Node1″ {pools “”}
resource scratchdisk “C:/IBM/InformationServer/Server/Scratch/Node1″ {pools “”}
}

What is the difference between a Sequential file and a Hash file?

The Hash file is based on a Hash algorithm and it can used with a key value. The sequential file on the other hand does not have any key column value. The Hash file can be used as a reference for a Look Up while a Sequential file cannot be used for Look Up. Due to the presence of the Hash key, the Hash file is easier to search than a Sequential file.