About my Blog

This blog will help the people, who are interested in Learning Teradata basics in deep.. And it will be helpful for Certification and interview perspective also..

By Santhosh.B

Thursday, 22 November 2012

Types of Space's in Teradata


DBC owns all the Disk Space

  • When the system arrives DBC owns all the Disk Space.
  • Each AMP will have one virtual disk, really four physical disks, which that AMP can read and write, but no other AMP can read or write to or from another AMPs virtual disk.
  • Add up all the AMPs disks and you will know how much space DBC originally owns.
  • This space is called PERMANENT Space, or PERM SPACE.
  • DBC will first CREATE a USER or a DATABAS.

Types of space in Teradata

  • There are three types of space in Teradata. They are called PERM Space, SPOOL Space and TEMP Space.
  • Perm Space is for Permanent Tables, Spool Space is used to temporarily build Answer Sets when users run queries, and Temp Space is used to CREATE and Populate Global Temporary tables.
  • In actuality Spool Space and Temp Space is unused PERM Space.
  • Most users don’t get their own PERM space.
  • All users get Spool Space. Without Spool Space the users couldn’t run queries.
  • Simply remember that Perm is for your Tables and Data and that Spool is used as space for Users to run queries.
  • Tables, Join Indexes, Permanent Journals, Hash Indexes, Stored Procedures and User Defined Functions (UDF) require Perm Space.
  • Views, Macros and Triggers don’t require Perm space.

Spool is like a Speed Limit

  • Teradata definitely has its limits and these pertain to Spool space.
  • The Teradata police will abort your query if at any time you go 1 byte over 20 GBs.
  • Think of PERM Space like money, but think of Spool Space like a speed limit.
  • If the database MRKT is assigned 20 GBs of Spool then that is MRKT’s speed limit. Each user can run queries that travel up to 20 GBs. This goes for all users in MRKT.

Spool is designed for two purposes.
  1. Users have a limit so they can’t hog the system resources.
  2. If users make a mistake and run a runaway query the system will abort it after it reaches that users spool limit. 



Monday, 5 November 2012

Join Index


Join index
  • The Join Index joins the two tables together and keeps the result set inside PERM Space.
  • Users don’t ever query the Join Index, but only query the base tables.
  • If a user does join these two tables together the Parsing Engine will be the one to decide if it is faster to build the answer set from the Join Index instead of actually performing the Join on the Base Tables.
  • In a sense the Join Index is the result of joining the two tables together so the Parsing Engine will almost always decide to take the data from the Join Index instead of joining the tables again manually.
  • When the base tables are updated so is the Join Index, thus keeping everything in sync.
  • In Oracle these are called Materialized Views.
  • It's not Pointers, but actual data is stored.
  • can have Non_Unique and Non_secondary indexes. collect statistics on PI ans SI.
Multi - Table Join index:
  • A Multi - table join index is a join index that involves two or more tables.
Ex:- CREATE JOIN INDEX INDEXNAME AS 
SELECT COL1.T1,COL2.T1,
COL1.T2,COL2.T2 
FROM T1 INNER JOIN T2
ON COL1.T1= COL3.T2 
PRIMARY INDEX(COL1.T1);

Single Table Join Index:-
  • A Single-Table Join Index duplicates a single table, but changes the Primary Index.
  • Users will only query the base table, but the Parsing Engine will use the Join Index if it is deemed to be faster.
  • The reason to create a Single-Table Join Index is so joins can be performed faster because no Redistributions or Duplication needs to occur.
Ex:- CREATE JOIN INDEX INDEXNAME 
AS 
SELECT COL1,COL2,COL3 
FROM T1 PRIMARY INDEX(COL2);
  • we have Just duplicated the T1 able, but with a diffrent primary index.(In base table we have PI col1)
  • This usually done for join processing. when we join the table with another table, No data will redistributed or duplicate.

Aggregate Join Index:-

  • An Aggregate Join Index will allow tracking of the Aggregates SUM and COUNT on any table.
  • This is used in conjunction with the Customer_Number and the Extraction of Month and Year from the Order_Date.
  • As the base table changes so does the values in the Join Index to reflect the changes.
Ex: CREATE JOIN INDEX INDEXNAME AS
SELECT OL1,COL2,COL3
EXTRACT(YEAR FROM COL4) AS YR,
COUNT(*),
SUM(COL5)
FROM T1 GROUP BY COL1,COL2,COL3;
  • Only the aggregates of sum and count can be used with an aggregate join index.
  • Extract month, year are allowed.
  • join index is update as base table changes.
  • Count and sum required to be data type float.

Partitioning table and Partition Primary Index

  • Queries die many times because range queries do a full table scan, but the partitioned table never runs out of spool!.
  • In Range Query, it uses the keyword BETWEEN.
  • The BETWEEN keyword in Teradata means find everything in the range BETWEEN this date and this other date.
  • The BETWEEN statement is said to be inclusive. If someone said to me tell me what is BETWEEN the numbers 8 and 10 I would normally say, “The number 9”. 
  • In Teradata land I would be wrong because the BETWEEN statement is inclusive, so it INCLUDES the starting and ending numbers. What is BETWEEN 8 and 10? The numbers 8, 9 and 10!.
  • Partitioned tables work very well on Range Queries using the keyword BETWEEN.

Why we had to perform a Full Table Scan

  • Order_Table spread across the AMPs. Notice that two colors are color coded the January and February dates.
  • Also notice that January and February dates are mixed on every AMP in what is a random order.
  • Because the January Data is on all AMPs and because the January Dates are randomly mixed we have to do Full Table Scan.
  • We had no indexes on Order_Date so it is obvious the PE will command the AMPs to do a Full Table Scan.
  • By Partitioning the table and we can prevent the Full Table Scan.
  • We partition tables so we won’t have to do a Full Table Scan on our Range Queries.

A Partitioned Table

  • Each AMP always sorts its rows by the Row-ID in order to do a Binary Search on Primary Index queries.
  • Partitioned Table will have the AMPs first sort their rows by the Partition.
  • Rows on an AMP don’t change AMPs because the table is partitioned If the table is partitioned then the AMP will sort its rows by the partition.
  • Partitioning doesn't affect distribution. Partitioning only affects how each AMP sorts the rows they get!!!

Add the Partition to the Row-ID for the Row Key

  • A partitioned table will always add two bytes to every row as part of the Row-ID.
  • If a table is partitioned, the partition number is placed in front of the Row-ID for each row.
  • This combination of the Partition number, Row-Hash, and Uniqueness value are now called the ROW KEY.
  • Instead of sorting by the Row-ID we are merely first sorting by the Partition Number. We are really just sorting by the Row Key!
  • If a table is NOT partitioned the Partition Number is merely set to ZERO!

Partition Primary Index:-

  • Suppose we are retrieving some portion of data from table.  
  • To prevent Full table scan, we are using partitioning table.
  • In PPI we are using some keywords. Those are Range_n, Case_n ,No case,  Unknown.
  • In Range_n we can retrieve range data, between dates or between some numbers…etc.
  • In Case_n we can retrieve data by using case operations.
  • In ‘no case’ partition data will be, but the data is, which data not satisfy the case condition.
  • The table can have up to 65,535 partitions.
  • Partitioning never determines which AMP gets a row.

Multi-Level Partitioning

  • You can have up to 15 levels of partitions within partitions.
  • Partitioned Table merely tells each AMP how to sort their rows for the table.
  • So think of Multi-Level partitioning as a table with multiple sort keys. The first partition statement is how the data is sorted first. The second partition statement is the second sort key.
  • Entire purpose of partitioning is to eliminate the Full Table Scan. Instead of reading all rows in a table each AMP merely has to one or more of their partitions.
Ex:- CREATE TABLE ORDER_DATA ( ORDER_NUM INTEGER NOT NULL
,CUST_NUM INTEGER
,ORDER_DATE DATE
,ORDER_TOT DECIMAL(10,2))PRIMARY INDEX(ORDER_NUM)
PARTITION BY (RANGE_N ( ORDER_DATE BETWEEN DATE ‘2012-01-01’ AND DATE 2012-12-31 EACH INTERVAL ‘1’ DAY), CASE_N(ORDER_TOT<1000, ORDER_TOT<5000, ORDER_TOT< 6000, NO CASE, UNKNOWN));

Friday, 2 November 2012

Secondary Index

  • Secondary Indexes provide an alternate path to the data.
  • So far we have learned that every table has one and only one Primary Index and we have learned that the Primary Index is much faster than the Full Table Scan.
  • Secondary Indexes are not as fast as the Primary Index, but they can be pretty fast, and they can be much faster than a Full Table Scan.
  • There can be up to 32 Secondary Indexes on a table.
  • Every Secondary Index creates a Subtable on every AMP designed to point to the real Primary Index Row-ID.
  • There are two types of Secondary Index and they are Unique Secondary Indexes, which are called USIs and Non-Unique Secondary Indexes called NUSIs.
  • An USI is always a Two-AMP operation so it is almost as fast as a Primary Index, but a NUSI is an All-AMP operation, but not a Full Table Scan.
The Secondary Index Subtable
  • If users want another access path to the data the secondary index is the option.
  • As soon as the USI is created with the SQL syntax the next move comes from Teradata creating a Subtable on every AMP. This is true for both the USI and the NUSI.
  • Let’s say for example the DBA created the maximum of 32 secondary indexes on a table. Then there would be 32 Subtables created, each taking up PERM Space.
  • The entire purpose for the Secondary Index Subtable will be to point back to the real row in the base table via the Row-ID.
USI – Always a Two-AMP Operation
  • Secondary Indexes provide an alternate path to the data.
  • Teradata runs extremely well without Secondary Indexes, but since secondary indexes use up space and overhead, they should only be used on “KNOWN QUERIES” or queries that are run over and over again.
  • Every time the Parsing Engine sees the USI column in the WHERE clause it comes up with a plan that involves only two AMPs. USI query is a two-AMP operation.
NUSI Subtable is AMP-Local
  • The NUSI Subtable is always AMP-Local.
  • A NUSI is a Non-Unique Secondary Index, obvious again meaning that the value is Non-Unique and there could be thousands, millions or even billions of duplicates.
  • So the Parsing Engine takes on a different strategy when building the NUSI Subtable. Each row in the Subtable only tracks the Base rows on the same AMP. This is what is meant by AMP-Local.
  • On the above fig you can see that the AMP labeled “A Typical AMP” holds two base rows of the Employee_Table. The First_Name values, which was the column we created the NUSI Index on holds two values on this AMP, which are “Rakish” and “Vu‟. So in this typical AMPs Subtable there will be two rows tracking “Rakish” and “Vu‟. A NUSI Subtable is always created on each AMP, but in each AMPs Subtable are only values local to the base rows for that AMP.
  • USI rows are hashed and NUSI rows are AMP-Local.
  • USI query is always a two-AMP operation.
  • A NUSI query is an All-AMP operation, but not a full table scan.
  • An USI query is much faster than a NUSI! The Parsing Engine will use an USI at a moment’s notice, but it will not always choose to use a NUSI. Sometimes it will choose a Full Table Scan over a NUSI.
  • The Parsing Engine will never choose a Full Table Scan over an USI.


Primary Index


  • Teradata uses the Primary Index of each table to provide a row its destination to the proper AMP.
  • This is why each table in Teradata is required to have a Primary Index.
  • The biggest key to a great Teradata Database Design begins with choosing the correct column to be the Primary Index.
  • The Primary Index value for a row is the only thing that will determine on which AMP a row will reside.
  • The Primary Index determines data distribution and provides the fastest physical path to retrieving data.
  • Primary Index also plays an incredibly important role in how joins are performed.
  • Every table must have one and only one Primary Index. Because Teradata distributes the data based on the Primary Index columns value.
  • The Primary index is the Physical Mechanism used to retrieve and distribute data.
  • The primary index is limited to the number of columns in the primary index. This means that the primary index is comprised totally of all the columns in the primary index.
  • You can have up to 64 multi-column keys comprising your primary index or as little as one column as your primary index.
Two Types of Primary Indexes (UPI or NUPI)
1) Unique Primary Index(UPI).
2) Non- unique primary Index(NUPI).

Unique Primary Index (UPI)
  • A Unique Primary Index means that the values for the selected column must be unique.
  • A Unique Primary Index will always spread the table rows evenly amongst the AMPs.
  • UPI access is always a one-AMP operation.
  • If we are using UPI value in 'where' condition or Join condition, no need of data redistribution, so performance will be very high.
Non-Unique Primary Index (NUPI)
  • Non-Unique Primary Index (NUPI) means that the values for the selected column can be non-unique.
  • Duplicate values can exist.
  • NUPI will almost never spread the table rows evenly.
Multi-Column Primary Indexes
  • A Primary Index can have multiple columns.
  • Still it is consider as only one Primary Index(combining multiple columns together).
  • Teradata allows up to 64 combined columns to make up the one Primary Index required for a table.
Two benefits:
  1. To get better data distribution among the AMPs.
  2. Users often use multiple keys consistently to query.
When do you define the Primary Index?
  •  When the table is created it is given a table name, the columns and their data types are defined and the Primary Index is specified.
  • If no Primary Index is defined the system will define one for you.
  • It will most likely pick the first column and make it a Non-Unique Primary Index (NUPI).
  • It will however check to see if you have a Primary Key defined for referential integrity purposes. If you do it will choose that column(s) and make it a Unique Primary Index (UPI).
  • If you didn’t define a Primary Index or Primary Key then the system will check to see if you defined a Unique Secondary Index (USI) on any column and if you have it will make that column a Unique Primary Index (UPI).




Thursday, 1 November 2012

How Teradata Retrieves data from Database?


Basically Parser Engine(PE) have some planes to retrieve the data from database.
1st) Single AMP operation: This is the fastest way to retrieve the data. In this PE will check for data in single AMP only. It will happen when we are using Primary Index value in Where condition.
2nd) Two AMP operation: This is 2nd fastest way to retrieve the data. in this PE will Check 2 amps for retrieving the data. It will Happen when we are using Secondary Index value in where condition.
3rd) ALL AMP operation but not full table scan, in this PE will check particular RANGE or CASE of data. It will happen when we are using Partition's based on primary index values.
4th) ALL AMP operation and full table scan. It will happen when we are not using where condition or in where condition if you used Non-indexed column. at that time PE will go for Full table scan.

Retrieving a row by way of the Primary Index: 


Think we have a table EMP, in that table we have columns like Enum,Ename,Sal...in that 'Enum' is Primary index.
 ENUM
ENAME 
 SAL
1001 
 XXXXXX
35000 
 1002
 YYYYYY
32000 

Now we will retrieving data from EMP table:
SELECT ENUM,SAL FROM EMP WHERE ENUM=1001;
in this case:

  • PE will take that primary index value and apply hash formula on that value.
  • By using row hash value, That will find the 'Hash bucket' value.
  • In that 'Hash bucket', we will get 'Hash amp' value.
  • So PE will go directly to that AMP, and retrieve the data from that table.
In this,why i am telling Single AMP operation means, by using '1001' 'row hash' value, PE will go directly where row is located.

Suppose in that Same query we don't have any where conditions, PE will go for ALL AMP operation..
SELECT * FROM EMP;

In this case PE will check every AMP for that data..

Remaining 2nd,3rd,4th we will discuss coming post's.. Just remember those points.



Thanks for visiting my blog.. if you have any doubt's please let me know.. Give me your valuable comments..
Thanks in advance.
By Santhosh.B

How Teradata Distribute the data?

  • The Primary Index is the column(s) that lays out the data row to the proper AMP.
  • The Primary Index column(s) is also the fastest way to retrieve a row from that same AMP(this one we will discuss in next post).
  • Teradata takes a table and spreads the rows across the AMPs one row at a time.
  • A Unique Primary Index on the table will spread the data rows perfectly evenly across the AMPs.
  • Teradata knows exactly which rows went to which AMPs so retrieval is always a 1-AMP operation when users use the Primary Index in the WHERE Clause of their SQL. Here is how that works.


Hashing the primary Index Value and Placing the row's.

  • The Teradata Parsing Engine will take the Primary index value of a row and run a math calculation called the hash formula on that primary index column value.
  • This hash formula does't change and can be calculated on any value or datatype.
  • The result of hash formula will result in a number ranging from one to one million.
  • The Teradata hashMAP with one million buckets, those buckets contain AMP number's. For example :- we have 4 AMP's in our TD system, so the million buckets contain numbers like 1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4,........ million buckets. 1st bucket contain 1st amp number, 2nd bucket contain 2nd amp num, 3rd bucket contain 3rd amp num, 4th bucket contain 4th amp number, and 5th bucket contain 1st amp number, 6th bucket contain 2nd amp num.... like this upto million.
  • 1st PE will generate row hash value of primary index value. that is 32 bit value, Is called Row Hash value.
  • In that 32 bit value,1st 16 digits are find the Hash bucket value.(for example: think 1st 16 bit's of Row hash value is gave value like '20', so PE will go to that 20th number bucket, and that will take amp number)
  • in that bucket we will get AMP number.
  • based on this , the row will lays in AMP's.
  • for example: We have one table in that table we have two columns number, name.. in this 'number' is primary index. now we are inserting one row in our table.
  • INSERT INTO TABLENAME VALUES(101,'SANTHOSH').. in this primary index value is 101, so 101 value will go to hashmap for math calculations, and that will give 32 bit value of 101. in that 32 bit value 1st 16bit's will give one number for example take '34', so our bucket number is 34, in that bucket, we have amp value like '2'... now our row will goto that 2nd amp..


Thanks for visiting my Blog.. and if i missed any thing please let me know.. or if you have any doubt's please give me your valuable comment's.. i will give you proper answer.. 

Thanking you all..

Tuesday, 30 October 2012

Teradata Architecture


Basically Teradata architecture have three components..
those are:
1) PE(Parser Engine)
2)BYNET
3) AMP(Access Module Processor).


PE-parser Engine
  • Heart of the Teradata.
  • The Parsing Engines are perfectly balanced, with each having the capability to handle up to 120 users at a time.
  • This could be 120 distinct users or a single user utilizing the power of all 120 sessions for a single application.
  • That is why there are multiple PE’s in every Teradata system.
  • Each PE has total command over every AMP.
  • Each PE will take users SQL and do three things:
1.Syntax check - check the users SQL syntax.
2.Security Check - check the users ACCESS RIGHTS.
3.Plan - PLAN to satisfy the user request.
  • The fastest plan is a Single-AMP retrieve.
  • The second fastest plan is a Two-AMP retrieve.
  • The next fastest plan will be all AMPs reading only a portion of the table, and The slowest plan is the full table scan. That is where each AMP reads every row they contain for a table.

AMP(Access Module Processor)

  • Each PE rules them all because the rows of every table are spread across all the AMPs.
  •  AMPs organize every table in separate blocks.
  • PE passes the PLAN to the AMPs over the BYNET.
  •  When a table is first created each AMP creates a table header on their disk.
  •  When the table is loaded each AMP receives rows for that table that they and only they own.
  • They carefully place the rows inside data blocks where they can easily be retrieved.

BYNET

  • The PE comes up with a PLAN and passes the plan to the AMPs in steps over the BYNET.
  • AMPs then retrieve the data requested by the PE and they deliver their portion of the answer set to the PE over the BYNET.
  • BYNET provides the communications between AMPs and Pes.
  • There are always two BYNETs for redundancy and extra bandwidth. AMPs and PEs can use both BYNETs to send and retrieve data simultaneously.


Database and Logical Modeling?


 Database:
                A database is collection of permanently stored data used by an application or enterprise. 
A database contains logically related data, which means that the database was created purpose of mind.  
A database supports shared access by many users.
·Protected access to data is controlled.
·Managed data has integrity and value.
·Based on the relational model.
Logical Modeling:
Tables are logically created for all database systems.
The logical model should be independent of usage. A variety of front end tools can be accommodated simultaneously so that the database can be created more quickly. Teradata supports normalized logical models, because we are able to perform 64 table joins and we are to perform large aggregations during queries.
A key Teradata strength is our ability to model the business of customers. Teradata Business models are truly normalized avoiding the costly star schema, snowflake. Teradata can do star schema and other types of relational modeling, but 3NF is recommended.

Teradata Advantages

·Automatic, Even data distribution.
·High scalability.
·Mature optimizer (complex queries, 64 joins, ad-hoc processing).
·Model the business: 3NF, Star schema..etc.
·Lowest TCO (Total Cost of Ownership):  Easy to install, Easy to work, Easy to manage and robust utilities.
·Acts like Single DataStore.

  • Many bulk load utilities: BTEQ, FASTLOAD, MULTILOAD, TPUMP, FAST EXPORT…

why Teradata?

·When compare other RDBMS it is good in performance wise, because it is shared nothing architecture.
·It can store billions of rows.
·Unconditional parallelism.
·Using Indexes for better storing and fast retrieval.
·Supports easy scalability from small (10GB) to a massive database (100+TB).
·System to grow to support more users/data/queries/complexity of queries without experiencing performance degradation.
·Provides a parallel aware optimizer that makes query tuning unnecessary to get a query to run.
·And Optimizer determines the least expensive plan (time-wise) to process queries fast and in parallel.
·Automatic and even distribution avoiding complex indexing schemes or time consuming
reorganizations.Single operational view of the entire MPP (massively parallel processing) system and single point of control for the DBA (TD manager).

Monday, 29 October 2012

What is Teradata(TD)?

·TD is a RDBMS in DW Environment.
·TD is an open system means that it is platform independent and satisfies industry standards.
·Compatible with industry ANSI standards.
·And it is currently available for the UNIX and WINDOWS operating system.
In this, for Teradata 13.0 version will support only windows7 32bit only. If you want to install in windows 7 64bit you should use VMvare.
And TD13.0 version is not supporting for windows XP. TD 12.0 version will install in XP o.s.

Check bellow table once.

Windows 7 64bit
Windows 7 32bit
Windows XP
TD 13.0
Use  VMvare
It will support
Use VMvare
TD 12.0
VMvare
VMvare
It will support
·It will run on single or multiple nodes or severs.
·And it can act’s like a server.
·Built in parallelism (for this I will give you deep explanation in coming posts).
·Client platforms access the database through TCP-IP connection or across an IBM mainframe channel connection.
·Large database server.
·The Teradata Database was the first commercial database system to support a trillion bytes of data. 10^12= 1,000,000,000,000 (Trillion) bytes.
·Built on a parallel architecture.