Managing and Inspecting a Loading Job

This page describes the commands to check loading job status, abort a loading job, and restart a loading job.

Job ID and status

When a loading job starts, the GSQL server assigns it a job ID and displays it for the user to see. The job ID format is typically the name of the graph, followed by the machine alias, following by a code number, e.g., gsql_demo_m1.1525091090494

Example of SHOW LOADING STATUS output
Kick off the following job, i.e.
  JobName: load_test1, jobid: demo_graph_m1.1523663024967
  Loading log: '/home/tigergraph/tigergraph/logs/restpp/restpp_loader_logs/demo_graph/demo_graph_m1.1523663024967.log'

Job "demo_graph_m1.1523663024967" loading status

[RUNNING] m1 ( Finished: 3 / Total: 4 )
  [LOADING] /data/output/company.data
  [=============                        ]  20%, 200 kl/s
  [LOADED]
  +--------------------------------------------------------------------------------------+
  |                  FILENAME |   LINES |   OBJECTS |   ERRORS |   AVG SPEED |   DURATION|
  |/data/output/company.data |       6 |         3 |        2 |      60 l/s |     0.10 s|
  +--------------------------------------------------------------------------------------+
[RUNNING] m2 ( Finished: 1 / Total: 2 )
  [LOADING] /data/output/company.data
  [==========================           ]  60%, 200 kl/s
  [LOADED]
  +--------------------------------------------------------------------------------------+
  |                  FILENAME |   LINES |   OBJECTS |   ERRORS |   AVG SPEED |   DURATION|
  |/data/output/company.data |       6 |         3 |        2 |      60 l/s |     0.10 s|
  +--------------------------------------------------------------------------------------+

By default, an active loading job will display periodic updates of its progress. There are two ways to prevent these automatic output displays from appearing:

  • Run the loading job with the -noprint option.

  • After the loading job has started, enter CTRL+C. This will abort the output display process, but the loading job will continue.

SHOW LOADING STATUS

The command SHOW LOADING STATUS shows the current status of either a specified loading job or all current jobs:

SHOW LOADING JOB syntax
SHOW LOADING STATUS job_id|ALL
gsql

The display format is the same as that displayed during the periodic progress updates of the RUN LOADING JOB command. If you do not know the job ID, but you know the job name and possibly the machine, then the ALL option is a handy way to see a list of active job IDs.

ABORT LOADING JOB

The command ABORT LOADING JOB aborts either a specified load job or all active loading jobs:

ABORT LOADING JOB syntax
ABORT LOADING JOB job_id|ALL
gsql

The output will show a summary of aborted loading jobs.

ABORT LOADING JOB example
gsql -g demo_graph "abort loading job all"

Job "demo_graph_m1.1519111662589" loading status
[ABORT_SUCCESS] m1
[SUMMARY] Finished: 0 / Total: 2
  +--------------------------------------------------------------------------------------------------+
  |                  FILENAME |          LINES |      OBJECTS |     ERRORS |   AVG SPEED |   DURATION|
  | /home/tigergraph/data.csv |              6 |            3 |          2 |      60 l/s |     0.10 s|
  |/home/tigergraph/data1.csv |              6 |            3 |          2 |      60 l/s |     0.10 s|
  +--------------------------------------------------------------------------------------------------+

Job "demo_graph_m2.1519111662615" loading status
[ABORT_SUCCESS] m2
[SUMMARY] Finished: 0 / Total: 2
  +--------------------------------------------------------------------------------------------------+
  |                  FILENAME |          LINES |      OBJECTS |     ERRORS |   AVG SPEED |   DURATION|
  | /home/tigergraph/data.csv |              6 |            3 |          2 |      60 l/s |     0.10 s|
  |/home/tigergraph/data1.csv |              6 |            3 |          2 |      60 l/s |     0.10 s|
  +--------------------------------------------------------------------------------------------------+
gsql

RESUME LOADING JOB

The command RESUME LOADING JOB will restart a previously-run job which ended for some reason before completion.

RESUME LOADING JOB syntax
RESUME LOADING JOB job_id
gsql

If the job is finished, this command will do nothing. The RESUME command should pick up where the previous run ended; that is, it should not load the same data twice.

RESUME LOADING JOB example
gsql -g demo_graph "RESUME LOADING JOB demo_graph_m1.1519111662589"
[RESUME_SUCCESS] m1
[MESSAGE] The current job got resummed
gsql

Verifying and debugging a loading job

Every loading job creates a log file. When the job starts, GSQL displays the location of the log file.

This file contains the following information which most users will find useful:

  • A list of all the parameter and option settings for the loading job

  • A copy of the status information that is printed

  • A report on the number of lines successfully read and parsed

The report includes how many objects of each type are created, and how many lines are invalid due to different reasons. This report also shows which lines cause the errors.

There are two types of statistics shown in the report:

  • File-level: The number of lines

  • Data-object-level: The number of objects

If a file level error occurs, e.g., a line does not have enough columns, this line of data is skipped for all LOAD statements in this loading job. If an object level error or failed condition occurs, only the corresponding object is not created, i.e., all other objects in the same loading job are still created if there is no object level error or failed condition for each corresponding object.

File level statistics Explanation

Valid lines

The number of valid lines in the source file

Reject lines

The number of lines which are rejected by reject_line_rules

Invalid JSON format

The number of lines with invalid JSON format

Not enough token

The number of lines with missing column(s)

Oversize token

The number of lines with oversize token(s). Please increase OutputTokenBufferSize in the tigergraph/app/<VERSION_NUM>/dev/gdk/gsql/config file.

Object level statistics Explanation

Valid Object

The number of objects which have been loaded successfully

No ID found

The number of objects in which PRIMARY_ID is empty

Invalid Attributes

The number of invalid objects caused by wrong data format for the attribute type

Invalid primary id

The number of invalid objects caused by wrong data format for the PRIMARY_ID type

incorrect fixed binary length

The number of invalid objects caused by the mismatch of the length of the data to the type defined in the schema

Note that failing a WHERE clause is not necessarily a bad result. If the user’s intent for the WHERE clause is to select only certain lines, then it is natural for some lines to pass and some lines to fail:

CREATE VERTEX Movie (PRIMARY_ID id UINT, title STRING, country STRING, year UINT)
CREATE DIRECTED EDGE Sequel_Of (FROM Movie, TO Movie)
CREATE GRAPH Movie_Graph(*)
CREATE LOADING JOB load_movie FOR GRAPH Movie_Graph{
  DEFINE FILENAME f;
  LOAD f TO VERTEX Movie VALUES ($0, $1, $2, $3) WHERE to_int($3) < 2000;
}
RUN LOADING JOB load_movie USING f="movie.dat"
gsql
movie.dat
0,abc,USA,-1990
1,abc,CHN,1990
2,abc,CHN,1990
3,abc,FRA,2015
4,abc,FRA,2005
5,abc,USA,1990
6,abc,1990
gsql

The above loading job and data generate the following reports:

File level loading report
  [LOADED]
  +--------------------------------------------------------------------------------------+
  |                  FILENAME |   LINES |   OBJECTS |   ERRORS |   AVG SPEED |   DURATION|
  |/home/tigergraph/movie.dat |       6 |         3 |        2 |      60 l/s |     0.10 s|
  +--------------------------------------------------------------------------------------+
[WARNING] bad data in m1 (replica 1) /home/tigergraph/movie.dat: 1 line(s) do not have enough number of tokens.
[WARNING] bad data in m1 (replica 1) /home/tigergraph/movie.dat:Movie: 1 object(s) have invalid attributes.
Sampling error data can be viewed by executing the 'SHOW LOADING ERROR <job_id>'.
gsql

SHOW LOADING ERROR

Beginning with TigerGraph 4.1, the loader will log a sample of the data lines that it could not load. The command SHOW LOADING ERROR <job_id> presents a table of the sample of malformed data lines. Each row reports not only a data line and its location, but also its error type. This information helps users to quickly identify and understand errors, leading to quicker resolution of data format problems.

Error Data Example

+----------------+------------------------------+------------+--------------------------------------------------------+
|    Source      |            Reason            |   Offset   |                        Content                         |
+----------------+------------------------------+------------+--------------------------------------------------------+
|/home/tigergraph|InvalidAttribute(load vertex:M|1           |0,abc,USA,-1990                                         |
|/movie.dat      |ovie)                         |            |                                                        |
+----------------+------------------------------+------------+--------------------------------------------------------+
|/home/tigergraph|ShortOfToken(file)            |7           |6,abc,1990                                              |
|/movie.dat      |                              |            |                                                        |
+----------------+------------------------------+------------+--------------------------------------------------------+
Done
gsql

The command shows two line of error data: * The data at line 1, it has an invalid attribute(year). * The data at line 7, it doesn’t have enough tokens(should have 4 attributes). By default, only the earliest 100 errors are shown, but you can limit the number of data displayed by using the LIMIT option:

SHOW LOADING ERROR <job_id> LIMIT <num_of_lines>`
gsql
User privileges for showing loading errors are the same as for running loading jobs.

The file level report points out the two data level errors in this Movie loading job. The load_output log file contains more detailed information.

load_output.log (tail)
===============================================================================
Source File Name: /home/tigergraph/movie.dat
Lines with not enough tokens: 1 [ERROR] (e.g. Line 7:6,abc,1990)
Lines successfully tokenized: 6
	Vertex:Movie
		Lines failed condition: 2 (e.g. Line 4:3,abc,FRA,2015, Line 5:4,abc,FRA,2005)
	Lines passed condition: 4
			Invalid Attributes: 1 [ERROR] (e.g. Line 1:0,abc,USA,-1990)
			Valid Object: 3
gsql

There are a total of 7 data lines. The report shows that:

  • One line - Line 7 - does not have enough tokens.

  • Six of the lines are valid data lines and were successfully tokenized.

Of the 6 valid lines,

  • Three of the 6 valid lines generate valid movie vertices.

  • One line has an invalid attribute (Line 1: year).

  • Two lines (Lines 4 and 5) do not pass the WHERE clause.

The errors that appear are the first line with an error in each batch in the loading job. Up to 100 sample lines showing errors appear in the log.