Oracle Partitioned Sequences – a future new feature in 12c?

Disclaimer: I’ll talk about an undocumented feature that appeared in Oracle 12c – undocumented except if you consider that being exposed in DBA_SEQUENCES as the PARTITION_COUNT is a kind of documentation. So, as the syntax is not documented (I got it only by guessing), you should not use it in production: there are probably some reasons why Oracle did not expose an interresting feature like that. Remark: Sorry for the formatting of this post. I tried to use nice tables instead of monospace code, but that was not a success.

In these days, you probably have most of primary keys coming from a sequence: a generated number that is always increasing in order to be sure to have no duplicates. And of course you have a (unique) index on it. But what do you see when you have a high activity inserting rows concurrently to that table ?

Here is what you can see:

That came from a retail database where we did a load test simulating the order entry rate that was expected for production. The AWR report shows the segments concerned:

No doubt: index hot block contention on primary key.

It is a well know issue. Because the index is sorted, the value coming from the sequence is always going to the last block. All sessions have to access to the same block and there is a ‘buffer busy wait’ contention on it. And when you are in RAC that block has to be updated by the different nodes and it’s worse, showing all those ‘gc’ wait events.

In order to illustrate the different solution, I have created a DEMO table with 1 million rows. And then I insert 100000 additional rows and check how many blocks where touched:

create table DEMO_TABLE (id,n , constraint DEMO_PK primary key(id) ) 
 as select rownum,rownum from 
  (select * from dual connect by level

And here is the number of index blocks touched:

select object_name,data_object_id,sum(tch) from sys.x$bh 
 join user_objects on user_objects.data_object_id=x$bh.obj 
 where object_name like 'DEMO%' 
 group by object_name,data_object_id order by 1 desc;

OBJECT_NAME	DATA_OBJECT_ID	SUM(TCH)
DEMO_TABLE	97346	243
DEMO_PK	97347	243

And I’m interrested about index block split as well:

select name,value from v$mystat 
 join v$statname using(statistic#) 
 where name like 'leaf node%splits';

NAME	VALUE
leaf node splits	200
leaf node 90-10 splits	200

I touched 200 blocks only (the size of the index entry is approximatively the same as the table row size). The highest block is filled with an increasing value and only when it is full the insert allocates a new block and goes to it. This is optimal for one session, but it is the cause of contention on multi-threaded inserts because all sessions are touching the same block at the same time.

So what’s the solution ? Of course, you want to have your primary key value distributed. Reverse index ? Let’s see:

alter index DEMO_PK rebuild reverse;
Index altered.

Then I run the same inserts and here are the statistics about index blocks:

OBJECT_NAME	DATA_OBJECT_ID	SUM(TCH)
DEMO_TABLE	97349	247
DEMO_PK	97352	4392

NAME	VALUE
leaf node splits	182
leaf node 90-10 splits	0

Now I touched 4000 blocks (without any block split because each new value fit in the 5% pctfree after my index rebuild). Great: no contention. I mean… no buffer contention.

But think about it. Because of my reverse distribution, I touch now all the index blocks. Do they fit in my buffer cache anymore ? Probably not. And one day, when I have more data, I encounter i/o issues.

If you want to see an illustration of the contentions I am talking about here, you can check the slides 14 to 19 from the Oracle Real Performance Tour where Tom Kyte, Andrew Holdsworth & Graham Wood have shown a nice demo of that. So we want to spread the values on several index blocks, but not on all index blocks. Hash partitioning can be used for that. Let’s have 4 partitions:

alter table DEMO_TABLE disable constraint DEMO_PK;
create index DEMO_PK on DEMO_TABLE(id) global partition by hash(id) partitions 4;
alter table DEMO_TABLE enable constraint DEMO_PK;

and the result is quite good:

OBJECT_NAME	DATA_OBJECT_ID	SUM(TCH)
DEMO_TABLE	97353	245
DEMO_PK	97357	76
DEMO_PK	97358	76
DEMO_PK	97359	76
DEMO_PK	97360	76

NAME	VALUE
leaf node splits	213
leaf node 90-10 splits	213

I’ve distributed my inserts over 4 partitions, having 4 hot blocks instead of one. This is a way to prevent buffer busy waits when having a few concurrent sessions inserting new values.

But the distribution is done on the value coming from the sequence. So each session will touch sequentially each of the 4 hot blocks. Even if this reduces the probablility of contention, it is not optimal. And if you’re going in RAC you will see those ‘gc’ wait events again, with the hot blocks being accessed by all nodes.

The actual solution to the problem is not to distribute the insert based on the value, but having the distribution key based on the session identification. If each session has its own index block to insert into, then all contention is gone.

This is exactly what will be addressed by the ‘partitioned sequence’ feature that is not (yet) documented.

 It generates sequence values in different ranges of value. And that range depends on the session (probably a hash function on the instance number and the session id).
I come back to my non-partitioned no-reverse index, and alter the sequence as:

alter sequence DEMO_SEQUENCE partition 4;

And here is the stats about index blocks:

OBJECT_NAME	DATA_OBJECT_ID	SUM(TCH)
DEMO_TABLE	97361	397
DEMO_PK	97364	404

NAME	VALUE
leaf node splits	351
leaf node 90-10 splits	351

First, you see that my table is larger. This is because the sequence number from the partitioned sequence is larger. It is build by prefixing the sequence number with a partition value, and that makes the binary representation larger. The table here have only two columns, but on a real table, the difference will not be so important. The index is bigger, that’s a fact. However if you compare it with the reverse index (that has a lot of free space in the blocks) it is much better here. And you can reduce the sequence max value if you want a smaller id. But the very good thing is that the instances and sessions will work on different ranges, avoiding block contention, while keeping the index maintenance optimal in buffer cache.

Here is my sequence:

select sequence_name,min_value,max_value,last_number,partition_count from user_sequences;

SEQUENCE_NAME	MIN_VALUE	MAX_VALUE	LAST_NUMBER	PARTITION_COUNT
DEMO_SEQUENCE	1	9999999999999999999999999999	10100000	4

Ok, that feature is really nice, but you can’t use it until it is documented in a future release. So you have to do it yourself: concatenate a value hased from the instance/session in front of the number coming from the sequence. Of course, at the Real Performance Tour the undocumented solution was not raised, and the solution presented was prefixing the sequence. Here is just as an example:

insert into DEMO_TABLE 
 select 1e30*mod(to_number(sys_context('userenv','sid')),4)+DEMO_SEQUENCE.nextval,...

and here are the index statistics:

OBJECT_NAME	DATA_OBJECT_ID	SUM(TCH)
DEMO_TABLE	97365	351
DEMO_PK	97366	351

NAME	VALUE
leaf node splits	433
leaf node 90-10 splits	371

So while dreaming about a feature that you may be able to use in the future, you can acheive the same goal if you’re ready to change your code. Anyway, achieving scalability, and good performance on high load often requires to touch the code a bit.

update:

If you want to see that feature documented, maybe voting up here can help.

4 Comments

Jonathan Lewis says:

July 26, 2017 at 18 h 49 min

Franck,

Looks like things have changed in 12.2 – part_count is still there in seq$ but the hidden parameters have changed, the “partition N” syntax doesn’t work at present, and the columns in dba_sequences have changed – preliminary note in a comment from a reader: http://jonathanlewis.wordpress.com/2015/06/17/reverse-key-2/#comment-9804

Reply to Jonathan

Franck Pachot says:

July 28, 2017 at 16 h 25 min

Hi Jonathan, thanks for the heads up. Yes it seems that syntax is simpler:

SQL> create sequence DEMO_SEQ minvalue 1 maxvalue 1000 scale;
Sequence DEMO_SEQ created.

SQL> select * from user_sequences where sequence_name='DEMO_SEQ';
SEQUENCE_NAME  MIN_VALUE  MAX_VALUE  INCREMENT_BY  CYCLE_FLAG  ORDER_FLAG  CACHE_SIZE  LAST_NUMBER  SCALE_FLAG  EXTEND_FLAG  SESSION_FLAG  KEEP_VALUE
-------------  ---------  ---------  ------------  ----------  ----------  ----------  -----------  ----------  -----------  ------------  ----------
DEMO_SEQ       1          1000       1             N           N           20          1            Y           N            N             N

SQL> select DEMO_SEQ.NEXTVAL from DUAL;

Error starting at line : 7 File @ /media/sf_share/122/blogs/scalable-sequences.sql
In command -
select DEMO_SEQ.NEXTVAL from DUAL
Error report -
ORA-64603: NEXTVAL cannot be instantiated for DEMO_SEQ. Widen the sequence by 3 digits or alter sequence with SCALE EXTEND.

SQL> select ksppinm,ksppdesc,ksppstvl from x$ksppi join x$ksppcv using(indx) where ksppinm like '%_kqdsn%';

KSPPINM                 KSPPDESC                                        KSPPSTVL
-------                 --------                                        --------
_kqdsn_instance_digits  number of instance digits in scalable sequence  2
_kqdsn_cpu_digits       number of cpu digits in scalable sequence       3

SQL> alter sequence DEMO_SEQ scale extend;
Sequence DEMO_SEQ altered.

SQL> select DEMO_SEQ.NEXTVAL from DUAL;
NEXTVAL
-------
1010170001

SQL> select DEMO_SEQ.NEXTVAL from DUAL;
NEXTVAL
-------
1010170002

SQL> connect / as sysdba
Connected.
SQL> select DEMO_SEQ.NEXTVAL from DUAL;
NEXTVAL
-------
1013920003

Good to see that the idea was not abandoned.

Reply to Franck

anbob says:

August 4, 2017 at 8 h 53 min

hi Franck,

101 017 0001 0001 was sequence value 017 was current sid what’s 101 mean?

Reply to anbob
- Franck Pachot says:
  
  August 4, 2017 at 10 h 00 min
  
  Hi Anbob, I’m not sure but it is supposed to have 2 digits for instance number (01) and 3 digits for ‘CPU’ – probably hashed from process id. However, it seems that there is an additional digit.
  
  Reply to Franck

Follow: Linkedin, Twitter, Youtube, Mastodon, dev.to