# Creating data distributions

You can generate statistics for a table and you can build data distributions for each table that your query accesses.

The database server creates data distributions, which provide information to the optimizer, any time the UPDATE STATISTICS MEDIUM or UPDATE STATISTICS HIGH command is executed.

The
database server creates data distributions by sampling a column's
data, sorting the data, building distributions bins, and inserting
the results into the **sysdistrib** system catalog table.

You can control the sample size for the scan through the keyword HIGH or MEDIUM. The difference between UPDATE STATISTICS HIGH and UPDATE STATISTICS MEDIUM is the number of rows sampled. UPDATE STATISTICS HIGH scans the entire table, while UPDATE STATISTICS MEDIUM samples only a subset of rows, based on the confidence and resolution used by the UPDATE STATISTICS statement.

You can use the LOW keyword with the UPDATE STATISTICS statement only for fully qualified index keys.

If a distribution has been generated for a column, the
optimizer uses that information to estimate the number of rows that
match a query against a column. Data distributions in **sysdistrib** supersede
values in the **colmin** and **colmax** column of the **syscolumns** system
catalog table when the optimizer estimates the number of rows returned.

When
you use data-distribution statistics for the first time, try to update
statistics in MEDIUM mode for all your tables and then update statistics
in HIGH mode for all columns that head indexes. This strategy produces
statistical query estimates for the columns that you specify. These
estimates, on average, have a margin of error less than *percent *of
the total number of rows in the table, where *percent* is the
value that you specify in the RESOLUTION clause in the MEDIUM mode.
The default percent value for MEDIUM mode is 2.5 percent. (For columns
with HIGH mode distributions, the default resolution is 0.5 percent.)

With the DISTRIBUTIONS ONLY option, you can execute UPDATE STATISTICS MEDIUM at the table level or for the entire system because the overhead of the extra columns is not large.

The database server uses the storage locations that the DBSPACETEMP environment variable specifies only when you use the HIGH option of UPDATE STATISTICS.

You
can prevent UPDATE STATISTICS operations from using indexes when sorting
rows by setting the third parameter of the DBUPSPACE environment variable
to a value of `1`.

For each table that your query accesses, build data distributions according to the following guidelines. Also see the examples below the guidelines.

To generate statistics on a table:

To build data distributions for each table that your query accesses:

- Run a single UPDATE STATISTICS MEDIUM
for all columns in a table that do not head an index.
Use the default parameters unless the table is very large, in which case you should use a resolution of

`1.0`and confidence of`0.99`. - Run the following UPDATE STATISTICS statement to create distributions
for non-index join columns and non-index filter columns:
`UPDATE STATISTICS MEDIUM DISTRIBUTIONS ONLY;`

- Run UPDATE STATISTICS HIGH for all columns that head an index. For the fastest execution time of the UPDATE STATISTICS statement, you must execute one UPDATE STATISTICS HIGH statement for each column, as shown in the example below this procedure.
- If you have indexes that begin with the same subset of columns, run UPDATE STATISTICS HIGH for the first column in each index that differs, as shown in the second example below this procedure.
- For each single-column or multi-column index on the table:
- Identify the set of all columns that appear in the index.
- Identify the subset that includes all columns that are not the leading column of any index.
- Run UPDATE STATISTICS LOW on each column in that subset. (LOW is the default.)

- For the tables on which indexes were created in Step 3, run an
UPDATE STATISTICS statement to update the
**sysindexes**and**syscolumns**system catalog tables, as shown in the following example:`UPDATE STATISTICS FOR TABLE t1(a,b,e,f);`

- For small tables, run UPDATE STATISTICS HIGH, for example:
`UPDATE STATISTICS HIGH FOR TABLE t2;`

Because the statement constructs the statistics only once for each index, these steps ensure that UPDATE STATISTICS executes rapidly.

## Examples

- Example of UPDATE STATISTICS HIGH statements for all columns that head an index
- Suppose you have a table
**t1**with columns**a**,**b**,**c**,**d**,**e**, and**f**with the following indexes:`CREATE INDEX ix_1 ON t1 (a, b, c, d) ... CREATE INDEX ix_3 ON t1 (f) ...`

Run the following UPDATE STATISTICS statements for the columns that head an index:`UPDATE STATISTICS HIGH FOR TABLE t1(a); UPDATE STATISTICS HIGH FOR TABLE t1(f);`

These UPDATE STATISTICS HIGH statements replace the distributions created with the UPDATE STATISTICS MEDIUM statements with high distributions for index columns.

- Example of UPDATE STATISTICS HIGH statements for the first column in each index that differs:
For example, suppose you have the following indexes on table

**t1:**`CREATE INDEX ix_1 ON t1 (a, b, c, d) ... CREATE INDEX ix_2 ON t1 (a, b, e, f) ... CREATE INDEX ix_3 ON t1 (f) ...`

Step 3 executes UPDATE STATISTICS HIGH on column**a**and column**f**. Then run UPDATE STATISTICS HIGH on columns**c**and**e**.`UPDATE STATISTICS HIGH FOR TABLE t1(c); UPDATE STATISTICS HIGH FOR TABLE t1(e);`

In addition, you can run UPDATE STATISTICS HIGH
on column **b**, although this is usually not necessary.