Satyaki De - SQL Exploration: regexp

Regular Expression on Teradata 14.0

Posted by Satyaki De

Sunday, February 23, 2014

I've been working for more than 8 years in Oracle 10g, 11g & worked significant queries on Regular expressions in various scenario using SQL. It is real handy if you know how to use it & can reduce lots of pain with single SQL. And, the performance will be better compared to the total effort to achieve the same functionalists by using multiple SQL queries or PL/SQL Procedures.

Last couple of years, I'm working on Teradata. And, on some occasion - I was expecting features like these, where I can easily manipulate data with regular expression. I'm pretty excited when I heard that Teradata also introduced Regular Expression from Version 14.0.

As a result, I tried all those features that I think can be handy & useful for various scenarios & followings are the successful queries that I get. There are two occasion, where Teradata partially able to manipulate those strings. I've checked the latest Teradata Manual. However, unable to find those solution. So, I'm expecting other forum members can contribute here in order to make this thread useful for every one of us. And, I'll post here as soon as I get some answers on these partial conversions.

For better understanding, I've provided the actual column value & after transformation value of that column in the output. That will help us to grasp it easily - I guess. :)

Case 1,

SELECT regexp_replace('SatyakiDe','([[:lower:]]{1,})([[:upper:]]{1,})','\1 \2') AS COL_VAL;
 
COLA             COL_VAL
---------------- ----------------------------------------
SatyakiDe        Satyaki De

Case 2,

select regexp_replace('919047242526','^([[:digit:]]{2})([[:digit:]]{10})','+\1 \2') COL_VAL;
 
COLA         COL_VAL
------------ ---------------
919047255555 +91 9047255555

Case 3,

select regexp_replace('+++C','^([[:punct:]]{2})([[:punct:]]{1})(.*)$','\1\3') COL_VAL;
 
COLA COL_VAL
---- -----
+++C ++C

Case 4,

select initcap(regexp_replace(regexp_substr(' satyaki.de@mail.com','[^@]+'),'(.*)(\.)(.*)','\1 \3')) COL_VAL;
 
COLA                             COL_VAL
-------------------------------- --------------------------------------------------
satyaki.de@mail.com              Satyaki De

Case 5,

select regexp_replace('100011001','([[:digit:]]{3})([[:digit:]]{2})([[:digit:]]{4})','XXX-XX-\3') as COL_VAL;
 
COLA             COL_VAL
---------------- --------------------
100011001        XXX-XX-1001

Case 6,

select regexp_replace('123456789','([[:digit:]]{3})([[:digit:]]{3})([[:digit:]]{3})','\3.\2.\1') as COL_VAL;
 
COLA      COL_VAL
--------- ---------------
123456789 789.456.123

Case 7,

SELECT regexp_replace('satyaki9de0loves3to8work2on2sql0and2bi6tools1','[^0-9]+','',1,0,'i') AS DER_VAL;

COLA                                            DER_VAL
---------------------------------------------   ----------
satyaki1de0loves3to8work2on2sql0and2bi4tools1   1038220241

As you can see, all the characters have filtered out from the string & only numbers are kept here. These sorts of queries are very useful in lots of different business scenarios as well.

So, any extra space may not produce desired result. And, needs to pay attention into these small details.

And, I've tested all these queries in the following two versions -

select * from dbcinfo;
 
    InfoKey  InfoData
    -------- ------------------------
1   VERSION  14.10.00.02
2   RELEASE  14.10.00.02
3   LANGUAGE SUPPORT MODE   Standard
 
 
select * from dbcinfo; 
 
    InfoKey  InfoData
    -------- ------------------------
1   VERSION  14.10.01.05
2   RELEASE  14.10.01.04
3   LANGUAGE SUPPORT MODE   Standard

Hope, this will give you much more clarity. :)

One more thing, I would like to clarify here - my intention is to describe more features about these regexp_(similar/substr/instr/replace) functions.

I've received one question whether these regexp functions available in TD 13 or not in Teradata forum while posting the same article over there.

And, here is my answer to that question -

Regarding version 13,

Let us check whether they have these regexp functions or not -

select * from dbcinfo;
 
    InfoKey  InfoData
    -------- ------------------------
1   VERSION  13.00.00.15
2   RELEASE  13.00.00.15
3   LANGUAGE SUPPORT MODE   Standard
 
 
select * from dbcinfo; 
 
    InfoKey  InfoData
    -------- ------------------------
1   VERSION  13.10.07.12
2   RELEASE  13.10.07.12
3   LANGUAGE SUPPORT MODE   Standard

select regexp_replace('SatyakiDe','^(.*)([[:upper:]]{1,})(.*) $','\1 \2\3') AS COL_VAL;
 
select regexp_replace('SatyakiDe','^(.*)([[:upper:]]{1,})(.*) $','\1 \2\3') AS COL_VAL;
 
select regexp_replace('SatyakiDe','^(.*)([[:upper:]]{1,})(.*) $','\1 \2\3') AS COL_VAL;
                                   $
  *** Failure 3706 Syntax error:  expected something between '(' and the string 'S' keyword.
                      Statement# 1, Info =35
 *** Total elapsed time was 1 second.

Hope this will give adequate clarity to the answer of that above question.

Now, Lets see some other functionality.

REGEXP_SIMILAR has similar functionality like REGEXP_LIKE in Oracle.

Let's see couple of such cases -

Lets prepare the table with some dummy data -

SELECT * FROM dbc.dbcinfo;
 
    InfoKey  InfoData
    -------- -----------------------
1   VERSION  14.10.01.05
2   RELEASE  14.10.01.04
3   LANGUAGE SUPPORT MODE   Standard
 
 
CREATE MULTISET VOLATILE TABLE TEST_T1
  (
            COL1  VARCHAR(10)
  ) 
ON COMMIT 
PRESERVE ROWS;
 
  INSERT INTO TEST_T1 VALUES('456')
  ;INSERT INTO TEST_T1 VALUES('123x')
  ;INSERT INTO TEST_T1 VALUES('x123')
  ;INSERT INTO TEST_T1 VALUES('y')
  ;INSERT INTO TEST_T1 VALUES('+789')
  ;INSERT INTO TEST_T1 VALUES('-789')
  ;INSERT INTO TEST_T1 VALUES('159-')
  ;INSERT INTO TEST_T1 VALUES('-1-');

Lets check the data now -

SELECT *
FROM TEST_T1;
 
    COL1
1   123x
2   456
3   x123
4   +789
5   -789
6   y
7   159-
8   -1-

Let's look into the various scenarios now -

Case 1 (Returns Mixed Numbers, Signed Numbers & Non Numbers),

SELECT *
 FROM TEST_T1
 WHERE REGEXP_SIMILAR(COL1,'^[0-9]+$','c')=0;
 
    COL1
    -----
1   123x
2   x123
3   +789
4   -789
5   y
6   159-
7   -1-

Case 2 (Returns Only Unsigned Positive Numbers),

SELECT *
 FROM TEST_T1
 WHERE REGEXP_SIMILAR(COL1,'^[0-9]+$','c')=1;
 
COL1
-----
456

Case 3 (Returns All Numbers including Positive, Negative & unsigned),

SELECT *
FROM TEST_T1
WHERE REGEXP_SIMILAR(COL1,'^[+-]?[0-9]+[+-]?$','c')=1;
 
COL1
-----
456
+789
-789
159-
-1-

Case 4 (Returns Only Non Numbers i.e. Characters),

SELECT *
FROM TEST_T1
WHERE REGEXP_SIMILAR(COL1,'[^0-9]+','c')=1;

COL1
----
y

Hope this will give you some additional idea. :)

My objective is to provide basic information to my friends. So, that they can write better SQL in TD while migrating from other popular databases or new developer in TD can get a flavor of this powerful feature & exploit them in all the positive aspect & apply them properly. :D

Really appreciate your time to read this post.

Regards.

Satyaki De.

Labels: 14, analytic function, code, data warehouse, pattern matching, regexp_replace, regexp_similar, regexp_substr, regular expression, snippet, sql, Teradata

Regular Expression In Oracle

Posted by Satyaki De

Sunday, January 3, 2010

comments (2)

From 10g in Oracle there is a significant power is given to all new Oracle Programmer and that is inherited from POSIX. Yes, you guess right - it is regular expression. One of the most powerful features that is missing for quite some time and programmer no need to write big and complex functions or procedures to produce their required result.

In this post i'm not going to discuss about the syntax and all the minute details which require to use it. Rather, i'll concentrate on some useful snippet on regular expression which might be helpful for you in your case or may require little enhancement from your part.

As i follow OTN, and find these requirements are frequently needed by many programmer. So, why not compiled them in a single platform. If you want to know the basic syntax of regular expression then you can visit this place.

One thing i would like to tell to produce the formatted output - i use CAST function here. Otherwise, it is not required in most of the case where i've used in this post. Hope you don't get confused for this.

Splitting comma-separated values:

scott@ORCL>select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.1.0.6.0 - Production
PL/SQL Release 11.1.0.6.0 - Production
CORE    11.1.0.6.0      Production
TNS for 32-bit Windows: Version 11.1.0.6.0 - Production
NLSRTL Version 11.1.0.6.0 - Production

Elapsed: 00:00:00.00
scott@ORCL>
scott@ORCL>
scott@ORCL>with tt
  2  as
  3    (
  4      select 1 id, 'saty,anup,john,sas' src from dual
  5      union all
  6      select 2, 'shank,rajib,papu,sanjit,amit' from dual
  7    )
  8  select k.*
  9  from (
 10             select distinct id,
 11                    regexp_substr(src,'[^,]+',1,level) cooked_src
 12             from tt
 13             connect by level <= (length(src) - length(replace(src,',',''))) + 1
 14      ) k
 15  order by k.id,k.cooked_src;

        ID COOKED_SRC
---------- ----------------------------
         1 anup
         1 john
         1 sas
         1 saty
         2 amit
         2 papu
         2 rajib
         2 sanjit
         2 shank

9 rows selected.

Elapsed: 00:00:00.38
scott@ORCL>
scott@ORCL>

Insert a space inside a joined word:

scott@ORCL>with res
  2  as
  3    (
  4      select 'SatyakiDe' cola from dual
  5      union all
  6      select 'RajibChakroborty' from dual
  7      union all
  8      select 'PranabPaul' from dual
  9    )
 10  select cola,
 11         cast(regexp_replace(cola,'^(.*)([[:upper:]]{1,})(.*)$','\1 \2\3') as varchar2(40)) cooked_cola
 12  from res;

COLA             COOKED_COLA
---------------- ----------------------------------------
SatyakiDe        Satyaki De
RajibChakroborty Rajib Chakroborty
PranabPaul       Pranab Paul

Elapsed: 00:00:00.04
scott@ORCL>

Insert a space inside a joined word:

scott@ORCL>
scott@ORCL>with fer
  2  as
  3    (
  4      select '919047242526' cola from dual
  5      union all
  6      select '919038220261' from dual
  7    )
  8  select cola,
  9         cast(regexp_replace(cola,'^([[:digit:]]{2})([[:digit:]]{10})','+\1 \2') as varchar2(15)) cooked_cola
 10  from fer;

COLA         COOKED_COLA
------------ ---------------
919047242526 +91 9047242526
919038220261 +91 9038220261

Elapsed: 00:00:00.03
scott@ORCL>

Remove one + from the source:

scott@ORCL>with hig
  2  as
  3    (
  4      select '+++C' cola from dual
  5    )
  6  select cola,
  7         cast(regexp_replace(cola,'^([[:punct:]]{2})([[:punct:]]{1})(.*)$','\1\3') as varchar2(5)) cooked_cola
  8  from hig;

COLA COOKE
---- -----
+++C ++C

Elapsed: 00:00:00.02
scott@ORCL>

Extracting number from string:

scott@ORCL>with tep
  2  as
  3    (
  4      select 'satyaki9de0is3a8studious2and2energetic0software2engineer6here1' cola from dual
  5    )
  6  select cola,
  7         cast(regexp_replace(cola,'[^0-9]','') as varchar2(12)) cooked_cola
  8  from tep;

COLA                                                           COOKED_COLA
-------------------------------------------------------------- ------------
satyaki9de0is3a8studious2and2energetic0software2engineer6here1 9038220261

Elapsed: 00:00:00.03
scott@ORCL>

Extracting names from mails:

scott@ORCL>
scott@ORCL>with reg
  2  as
  3    (
  4      select 'satyaki.de@gmail.com' cola from dual
  5      union all
  6      select 'pranab.paul@aol.in' from dual
  7      union all
  8      select 'tuhin.chakroborty@rediffmail.com' from dual
  9      union all
 10      select 'debraj.saha@yahoo.com' from dual
 11    )
 12  select cola,
 13         cast(
 14               initcap(regexp_replace(regexp_substr(cola,'[^@]+'),'(.*)(\.)(.*)','\1 \3')) as varchar2(50)
 15             ) cooked_cola
 16  from reg;

COLA                             COOKED_COLA
-------------------------------- --------------------------------------------------
satyaki.de@gmail.com             Satyaki De
pranab.paul@aol.in               Pranab Paul
tuhin.chakroborty@rediffmail.com Tuhin Chakroborty
debraj.saha@yahoo.com            Debraj Saha

Elapsed: 00:00:00.03
scott@ORCL>

Insert spaces between small & caps letter:

scott@ORCL>
scott@ORCL>with kit
  2  as
  3    (
  4      select 'AbraCaDabra' cola from dual
  5    )
  6  select cola,
  7         cast(
  8                regexp_replace(cola,'([[:lower:]])([[:upper:]])','\1 \2') as varchar2(20)
  9             ) cooked_cola
 10  from kit;

COLA        COOKED_COLA
----------- --------------------
AbraCaDabra Abra Ca Dabra

Elapsed: 00:00:00.02
scott@ORCL>

Masking credit card information:

scott@ORCL>with jol
  2  as
  3    (
  4       select '4567098723560984' cola from dual
  5    )
  6  select cola,
  7         cast(
  8               regexp_replace(cola,'([[:digit:]]{4})([[:digit:]]{4})([[:digit:]]{4})([[:digit:]]{4})','\1-XXXX-XXXX-\4')
  9               as varchar2(20)
 10             ) cooked_cola
 11  from jol;

COLA             COOKED_COLA
---------------- --------------------
4567098723560984 4567-XXXX-XXXX-0984

Elapsed: 00:00:00.02
scott@ORCL>

Convert a number series to ip address & also print this in reverse order:

scott@ORCL>
scott@ORCL>with rev
  2  as
  3    (
  4      select '123456789' cola from dual
  5    )
  6  select cola,
  7         cast(
  8               regexp_replace(cola,'([[:digit:]]{3})([[:digit:]]{3})([[:digit:]]{3})','\3.\2.\1')
  9               as varchar2(15)
 10             ) cooked_cola
 11  from rev;

COLA      COOKED_COLA
--------- ---------------
123456789 789.456.123

Elapsed: 00:00:00.02
scott@ORCL>

Hope you like this.

Very soon i'm going to post another useful snippet on some other topic. Till then best of luck and keep following this blog.

Labels: OTN, regexp_replace, regexp_substr, regular expression, snippet

Satyaki De - SQL Exploration

Oracle Certified

Me

About Me

Readers

Visitors

Search This Blog

Blog Archive

Labels

Regular Expression on Teradata 14.0

Regular Expression In Oracle